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ABSTRACT 


Parallelism in microprogramming systems is inves- 
tigated here with respect to the following problems: 
(a) The identification of parallel micro-operations 
imvscrarght-line microprograms.  Harlier solutions stots 
problem include algorithms which, while fairly general, 
do not guarantee optimal output; there are also several 
other algorithms, which attempt to optimize the output 
but are restricted in their applicability. The analysis 
of straight-line microprograms is extended in this thesis 
and a new, general, optimizing algorithm is presented. 
(b) Identification of parallel micro-operations in 
loop-free microprograms. This is the problem of "global" 
parallelism (in contrast to that of "local" parallelism 
referred to in (a) above) and its analysis here within a 
graph-theoretic framework leads to a method of detecting 
"globally parallel" micro-operations. Global analysis 
may - though not necessarily - produce more optimal micro- 
code than that produced by local analysis alone. Thus, 
it becomes an important strategy in designing architec- 
tures, when executional time efficiency is the main 
objective. 
(a) Since one cannot guarantee that mechanical proce- 
dures will produce optimal microcode in an arbitrary 
microprogram, it seems desirable that a micro-software 


system should give the microprogrammer, the choice as to 
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whether optimization is to be performed mechanically or 
by the programmer. This consideration gives rise to the 
problem of developing language constructs for expressing 
horizontal (i.e. "parallel") microprograms explicitly. 

A solution to this problem is the third major result of 
this study: language constructs are proposed and their 
semantic features discussed. These constructs not only 
allow the expression of micro-parallelism, but also 
enable microprogram verification rules to be established 
analogous to rules discovered for "higher level" program 
Stacemenes. 

(d) Potential parallelism is defined in this thesis as 
the parallelism embedded in the (writable) control memory 
(micro-) word organization. The last of the problems 
considered here is the analysis of potential parallelism 
with respect to (i) its maximization using as a basis, 
the assignment of micro-operations to clock-cycle phases; 
and (ii) its application in the determination of the 
smallest minimally-encoded control memory word. Previous 
studies of the so-called "control memory minimization 
problem" were concerned with read-only memories. These 
results are extended here to the case of writable control 


memories. 
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CHAPTER I 
INTRODUCTION 


1.1 Review 


This thesis presents the results of a study of 
parallelism in microprogramming systems. As such it is 
antended “as ‘a contribution, not’ only to the steadily 
growing catalogue of microprogram optimization strategies, 
but also to our understanding of the nature of parallel 
processing systems. 

The design and implementation of microprogrammed 
control units in fact, represents one of the earliest 
developments in parallel processing. For example, Wilkes! 
Original design, and many of the initial extensions of 
this model (discussed by Husson [38]), fall within the 
category of what we now refer to as "horizontal" micro- 
programming. In such systems several primitive operations 
are executed within a basic machine cycle. 

There were however, practically no attempts to 
analyse or develop models of, parallelism at the micro- 
programming level until the present decade. This can be 
contrasted to the extensive analysis of multiprocessing 
and other "higher level" concurrency phenomena which have 
continuously emerged since the early 1960's [6,10,12,37,53]. 

The very recent surge of interest in optimization, 


Darallelasm, “and other “formal” aspects of microprogramming 
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stems from a number of reasons, the most significant of 
these being the emergence 38 the writable control memory 
as a technologically viable storage medium. A wide 
variety of machines are now commercially available 
[75,78,79,80], which permit users to define their own 
architecture through "dynamic" microprogramming, and it 

is easy to realize how the availability of this technique 
has - in theory at least - enlarged the scope of micro- 
programming far beyond the original objectives established 
by Wilkes. 

As a result, extensive experimentation is currently 
in progress on such applications as the implementation of 
high-level language architectures [9,14,16,33,60,73], 
operating system "environments" [64,74], and "universal" 
host machines for emulation [24,49]. It seems likely 
that such applications will involve larger; “and far more 
complex microprograms than are required for realizing 
simple machine language instructions. The latter of 
course, has been the traditional role of microprogramming,. 

A second consequence of writable control stores is 
that microprogramming is being examined more as a 
programming activity. As a result, several high level 
languages have been proposed or implemented with the 
primary objective of enhancing the ease of writing micro=- 
programs [15,22,26,34,47,54,56]. Numerous authors have 


pointed out however, that the usefulness of these languages 
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will depend heavily, on how efficient the object micro- 
code is, as compared to conventionally produced microcode. 
These observations have thus provided the general impetus 
to the analysis of microprograms with a view to optimi- 
zation, an area of research largely initiated by Kleir 
and Ramamoorthy [40]. 

The term’ "optimization" as used by these sithores 
refers essentially to strategies for deleting redundant 
micro-operations within a sequence of such operations. 
The analogy with program optimization is obvious, and in 
fact, Kleir and Ramamoorthy adopted many of the ideas 
Originated by Allen [4,5] for machine code optimization. 

From the viewpoint of effectiveness however, there 
is a small but vital distinction between program and 
microprogram optimizations. For, in the former case, 
elimination of a single instruction is"useful" in that 
it reduces program execution time (by the amount required 
to fetch and execute the deleted instruction). The 
deletion of a single micro-operation on the other hand, 
may not necessarily reduce microprogram execution time, 
since the "useful" unit of activity in this case is the 
microinstruction, which may contain several micro- 
Operations. “Deletion of “a micro-operation thus, becomes 
Uscrulmonlye1f the result of the deletion as the elimina= 


tion of a microinstruction also. This will certainly 


happen in the case of vertical microprogramming systems, 
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but not necessarily so in horizontal schemes. 

Thus in addition to techniques for optimizing 
microprograms in the above sense, strategies are required 
for compacting the microcode into as small a set of micro- 
instructions as possible; in other words, producing 
optimal or near-optimal horizontal microprograms. 

The class of techniques for achieving this objec- 
tive is loosely termed horizontal optimization. A major 
part of the present thesis is addressed to this problem, 
which can be stated more precisely as follows: 

Let A be an algorithm to be implemented as a 
mMicroprogram. Then A can be realized by a sequence of 
micro-operations say S, such that the sequential execution 
of S produces the desired result. I shall term such a 
micro-operation sequence, a canonical microprogram. The 
problem of horizontal optimization is, to determine for 
a given canonical microprogram (a) a partition of the 
micro-operations contained in it such that the micro- 
operations in each partition block can be executed in 
parallel (in some well defined sense that will be specified 
Vatexs) »pand the number of blocks for the given canonical 
microprogram is minimum; and (b) an ordering of the par- 
tition blocks such that the execution of the ordered set 
of blocks produces the same result as would be produced by 
the execution of the canonical microprogram. Fig. 1.l 


schematizes this particular aspect of optimization. 


: 108 os ashes forte 88 
-oroim 29 J98)6 theme” “ee ihe ) ut iS qT 
pabaubo%d pbtow sent. Lan en ob 
-emsxporgoxy Lm Paonesiee Temiza o- a t 
-ostdo eins piivetios, ror dou bodaas to aie . i" 
tof srm.A -tok+ 88 Lbs * “teanosizon bears ~ieeoot ek was 
vine Poxra abd) go nseze bem a ataodd taaastde edt to seq 
“ sewollot 26 vies tostq SsTomn bessse 4a aed roti 
6 25 becneme tant od OF pmasiszopis nied A tot « . ; 
oO, eaaeupoe 5. be hastls: ay ead ABO A peril sigxpoxgovobi 
sotjuesxe fstsnsupes odd 4nt2 «ine 2 yee anoktssaqo-oxim 
6 Hobe mig I baie, bs wtlitees De ‘iesh od cada 2 toe 


aii .mexpougetosi tevinoniss 6, sahaupss fo. ts xegosozaie 


tot enivrsedeb 03 iat poisenim dqa Letnax trod x6 olde aH 
ads fo nofteiyrsa: 5 (4) matborgex. in Leo thomso nawipes ; 
-“giofm sit tect tage 72 mh bantsdno> 2nobtexaqondaitm _ 
ot bedugete edimey Aoold nokiityeq ives, ct atte bketiago 7 


eal La ticw tedt sansa, bemiteb [iéw smoe ah) ii! : 


CANONICAL MICROINSTRUCTION SET 
MICROPROGRAM = 
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General Scheme for Horizontal Optimization 


Sets of Micro-operations to be placed in Each Word: 


An Example 
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In order to preserve all the parallelism specified 
in these blocks, the microinstruction word ("microword") 
organization must allow each block Tor to be placed ina 
word of control memory; otherwise, a part of the paralle- 
lism will be lost. For example, the block I, nigel ie pan. ThA 


contains micro-operations wu For maximum 


GT Bea al re 
efficiency, the control memory word must be so organized 

as to allow these micro-operations to be specified ina 
Single word. 

I shall use the term potential parallelism to 
denote the parallelism implicit in a given microword 
organization. 

In the design of read-only control memories (ROM's), 
potential parallelism is only of marginal, interest.) For, 
in this case, the precise nature of the microprograms 
defining the machine's instruction set is known a priori, 
and the microword organization and timing behaviour can 
be so determined as to maximize the average actual 
parallelism per microinstruction. 

The situation for machines with writable control 
memories (WCM's) is however quite different: in this case, 
the nature of the user microprograms will be unknown to 
the designer. Thus, if a horizontal WCM word organization 
is to be used, clearly one of the desirable performance 


objectives is to enhance the microword potential paralle- 


lism as far as possible. 
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AN examination .of the precise nature of potential 
parallelism constitute a second major focus of investi- 
gation in this thesis. 

The concept of potential parallelism is also useful 
in the context of the control memory minimization problem. 
Stated succinctly, this refers to the problem of minimiz- 
img the word slengthfottcontrol memories. Earlier, formal 
investigations in this area [20,32,62], were focussed 
principally on the minimization of read-only memories; 
tbat is; divengasprlorreknowl edgenthat specific micro- 
operations are to be executed in parallel, to construct a 
minimally encoded ROM word of minimal length [20,58], such 
that all the parallelism could be accommodated and there 
were no conflicts otherwise. As a specific example, 

Fig. 1.2 shows sets of parallel micro-operations. The 
problem is to determine a minimum-length, minimally- 
encoded word so as to permit each of the sets to be 
executed in parallel. 

The concept of potential parallelism is applied in 
the spresent work, to extend the results of Das et aly 119] 
on ROM minimization to the problem of minimizing writable 


control memory word lengths. 


be 2eeDeLining Parallelism 


In very general terms, two processes or "tasks" 


die and T, are said to be executable in parallel if, given 
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a task stream containing TS and The the two tasks are 
mutually independent according to some criteria; if the 
latter are satisfied, the tasks can be executed "at the 
same time", 


The exact nature and complexity of the criteria 


are determined by several factors. Specifically: 
(a) The nature of the tasks; 
(b) The structure of the task stream, e.g., whether 


the stream contains conditional branches or not; 
(ey The nature of the "processors" which are to 

execute the tasks; and finally 
(d) The quantum of time used to determine simul- 

taneity of execution. 

Consider for example, the situation where we have 
two identical processors sharing a main memory; we want 
to know under what conditions two tasks To? The OLIginda LLY 
scheduled for sequential execution, can be initiated in 
parallel. Necessary and sufficient conditions were first 
obtained by Bernstein [10] and may be summarized by the 


relation 


(SC, 1 SK, = >) A (SK, 1 SC) = o) A (SK, 1 SK) = o) C1 1) 


where SC.) SK, (SC. , SK,) denote respectively, the sets 


b 


of memory elements used as the data source and data sink 


Dy aL (T)), and » denotes the empty set. These are the 


a 
so-called data independency conditions. 
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Implicit in conditions (1.1) are the assumptions 
that (i) there are available two (or more) processors 


each of which can execute TS and T and (41) ie and T 


b’ 


are primitive tasks for the level of processing being 


b 


considered. 


ne ae and T, are non-primitive tasks, i.e., if 


b 
they can be further decomposed into smaller but still 
meaningful tasks at the level of processing being con- 
Sidered, there (1.1) will not define necessary conditions. 
For instance, in a multiprogramming environment, there 
may be several, logically distinct processes executing 
concurrently, yet operating on a shared variable. As 

far as the processes are concerned, (1.1) is violated 
(because of the shared variable). However, by placing 
operations on the shared variable within critical regions, 
these particular operations are made mutually exclusive 
over time [12,13]. Yet the overall processes satisfy the 
intuitive notion “of iparallelism: 

As will be seen later, parallelism at the micro- 
programming level involves both simultaneous and non- 
simultaneous processes. This is a consequence of the 
timing, characteristics ofkyehe .controlsunit, gandpthesrtact 
that by convention, the meaningful unit of activity is 
the microword. Stated simply micro-parallelism is the 
phenomenon of potential or actual activation of multiple 


micro-operations from a single microword. The present 
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dissertation is then concerned with the development, 


refinement, and application of this simple notion. 


eo sOrdgani zation roheene Thesis 


Chapter II surveys some of the earlier researches 
on micro-parallelism. Since much of this work was 
concerned with the automatic detection of parallel micro- 
operations in branch-free microcode, the survey is largely 
dominated by this topic. To provide a framework for the 
discussion, a model of the architecture of a "micro- 
programmable machine" is proposed in the earlier part of 
this chapter. 

Chapter III develops the notion of potential 
parallelism; procedures for enhancing the potential 
parallelism in microwords, and minimizing their word 
lengths are presented. 

In Chapter IV, a new, general, optimizing algorithm 
for detecting parallelism in "Straight line" microprograms 
is presented. The proposed algorithm compares rather 
favourably with earlier efforts; for while the latter 
include at least one algorithm that is quite general 
in its applicability - that due to Jackson and Dasgupta 
[39] - it does not attempt to optimize. On the other 


hand, the Yau-Schoewe-Tsuchiya algorithm [77] produces 
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optimal output but is limited in application to monophase 
microprograms. 

Extending the detection of parallelism to branch- 
containing microprograms becomes important when efficiency 
of (microprogram) execution is of prime consideration, 

The problem of what I call "global" parallelism (in con- 
trast to’ the: “local® parallelism in ‘straight. lane micro= 
programs) is analysed using graph-theoretic concepts in 
Chapter V, and a system of algorithms is developed for 
detecting both local and global parallelism in "loop-free" 
microprograms. 

It must be remembered that the whole idea of hori- 
zontal optimization stems from the premise that micro- 
programs will be written in sequential form in some high 
level language, and that optimization will be performed 
by the compiler. I feel however, that the microprogrammer 
should in fact, be given a choice as to whether optimiza- 
tion is to be done mechanically or manually; this seems 
rather important given the fact that there are limits to 
the extent of mechanical optimization that is feasible. 
Thus highly used segments of microcode can be optimized 
by the programmer, leaving less frequently used segments 
to the compiler. 

The above considerations lead to the interesting 
problem of developing language constructs for horizontal 


microprogramming. 
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This problem forms the subject matter of Chapter 
VI. I will argue that it is not merely sufficient 
for a particular set of constructs to express micro- 
parallelism, they should also allow microprogram veri- 
Biceausonerulesstosbe testablashedein the tradi tionsor 
Similar rules discovered by Hoare for higher level pro- 
gramming statements [35,36]. A specific set of constructs 
are proposed in Chapter VI, and their features discussed. 


Veri£ication rules for these constructs are also determined. 
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CHAPTER II 


MICROPARALLELISM: A FRAMEWORK AND SURVEY 


2.1 Architecture of the Microprogrammable Processor 


A microprogrammable processor (MP) is simply the 
processor "as seen by" the (microprogrammed or micro- 
programmable) control unit. Given below, is a model of 
the MP which can serve as a framework for much of the 
discussion that follows. 

An MP is characterized by (i) a set of resources 
R = Mu0OuP where M is a set of memory elements, 0 a set 
Of Cperational units, and — al set of data-paths; and (ii) 
a set of feasible events E. Examples of elements from 
the sets of (i) are respectively, registers, the 
arithmetic-logic unit (ALU) and a path between a register 
Cutout and an ALU) input’. 


An event Ec E can be one or a combination of the 


following: 

(a) a simple flow of information along a path Pe P; 

(b) a registration of information in a memory element 
MeM; OG 

(c) the activation of a unit 0e 0 thereby causing 0 to 
PerLorm a computation. Invsuch a scase 1t is 


assumed that 0 simply extracts the argument on its 
ports, computes the desired function, and presents 


the result on its output port. 
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For example, let M be a memory element, 0 a shift 
unLt, and ’P -“aypath from M®to“0's “input, Then two possi- 


ble events may be described symbolically by: 


is 
OINPUT «<—— M; (27) 


OOUTPUT <-. shiftleft (OINPUT) ; (Ze 2) 


The first event causes the contents of M to be trans- 
ferred (along P) to 0's input; the second event causes 
a "shift left" operation to be performed by 0 on its 
input argument. 

In the MP, an event is caused by a control signal 
originating in a read-only memory (ROM) or a writable 
control memory (WCM); each such signal is termed a micro- 
operation (MO). Let the set of all MO's be denoted by 
pees Teeissassumed@that Utwisea, Leni tersct, fandgthat 
there exists a one to one correspondence between elements 
of u* and elements of E. Because of this correspondence, 
the term "micro-operation" can be used without ambiguity, 
to denote both the control signal and the event invoked 
by the signal. 

The control memory is considered to be a linear 
sequence of words (microwords). Each microword in turn 
is- composed of a set of subwords?or fields: ©The precise 
organization of the fields is not of importance for the 


present. For our purposes, only those fields are of 
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interest which are directly responsible for the execution 
of the micro-operations. More precisely, each such field 
ie 1S associated with, or is an encoding of, a specific 


Subset=of MO"s Erom uses. 


F. = 


fl {Usqe Higgs coere Eig C23) 


such that at any given time, one and only one of these 
MO's can be executed. Thus MO's encoded in the same field 
are mutually exclusive over time. 

The execution of MO's is controlled by a machine 


cycle C, characterized by a set of phases I Toyeee- 


a bis k 
(Fig. 2.1) such that, each MO is executed in a specific 
phase or sequence of phases of C; or (less frequently), 
the MO's execution time spans several cycles. The present 
model assumes that all MO's are synchronous. The phases 
of the machine cycle may overlap, as shown in Fig. 2.1. 

A microinstruction I is a (microprogrammer) speci- 
fied set of MO's executed (or to be executed) from a 
single microword. The relation between microword and 
microinstruction is analogous to that of the class of 
instructions of a particular format, and an instruction 
Of that format at. the machine instruction level. One 
must note however that in the case of ROM's, the micro- 
word has no separate identity of its own since each micro- 
word) im the ROM 1s im eritect, a4 distinct microinstruction. 


As stated above, MO's are assumed to be activated 


in one or more phases of the machine cycle C, or over 
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A Polyphase Machine Cycle 
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several such cycles. A further aspect of timing is the 
assumption that the execution of a microinstruction 
requires one or=more™machine cycles? that is, if te 

denotes the duration of a machine cycle, then a micro- 
instruction r is executed in time Nite for some integer 

MG 2 1 depending on bac In the usual case NG =a, e2.e.7 
the microinstruction cycle time is the same as the machine 
cycle. The most common situation under which Nu Smders 

a fe includes a main memory read, or write operation. 

Fig. 2.2 provides an instance of a machine cycle containing 
a number of non-overlapping phases; the class of operations 
associated with each phase is also indicated. 

A microprogram is any sequence of microinstructions 
the execution of which causes a machine instruction from 
Main memory to be partially or completely interpreted. 

This completes the description of the MP. Further 
elaboration or refinements of this architecture will 
follow in relevant sections of the thesis. I shall com- 
plete this section however, by introducing a useful 
motation £or denoting MO'si™eThe particular»representation 
used here, has evolved from notations proposed originally 
by Kleir and Ramamoorthy [40], developed further by Sitton 
[63] and later modified by Jackson and Dasgupta [39]. 


A micro-operation will be denoted by the 5-tuple 


ise=8< OP; SC Teck; UP Vv > (2.4) 


where 
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'OP' designates a primitive operation, e.g., ADD, SHIFT, 
GATE; 

'SC', 'SK' denote the data source and Sink secs .espec— 
tively: forent0R*: 

'U' denotes the set of operational units and/or paths 
required to execute UU. U will be simply called u's 
unit; and 

'v' is a symbol representing the phase(s) of the machine 
cycle, or the number of such cycles in which u is 
executed... Vi is,calied, the. time=validity of wl. 

If the MO simply involves information flow along a 
path, then the U field can be left unspecified as long as 
the path is implicitly defined by the SC and SK fields. 
Finally, given the above representation, u's resources 
are given by Ri = SC uSKu U. Some examples of MO's using 


the above representation are: 


UW, = <GATE, {M1 } , tALU-LEFT}I, he 
Mee NOT, {ALU-LEFT} , {ALU-ouUT} , {ALU}, I> 
(2s) 
H, = <GATE, {ALU-oUT} , {M2} , , I> 
We) OnLy {M2 } , “M2 , {SHFTR},I,> 


2.2 Analysis of Straight Line Microprograms 
Ae the time of writing, most, of the work on detect— 
ing parallel micro-operations were concerned with straight 


line microprograms (SLM's). An SLM is simply, a sequence 
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of micro-operations. 


ee SU Hore eer Hy? 


with a single entry point (uy) and a single exit point 
(up) - The term is thus synonymous with "basic block" 
aseusedy;in the theory of program Optimization [2,5] - 

In this section I shall review some of the known results 
on SLM's,. 

As stated in Chapter I, microprogram optimization 
techniques were pioneered by Kleir and Ramamoorthy [40]. 
The same authors attempted to formulate precisely, the 
conditions necessary and sufficient for microparallelism: 
two micro-operations Uzrlar could be placed in the same 


J 
microinstruction provided that 


(SC, 9 SK. = >) A (SK, n SC. = ¢) A (SK; n SK. =o) A (Usp U, = ) 
26) 


Note: that (2.6)" 1s stronger than Bernstein's condition 
(1.1). As pointed out earlier, Bernstein assumed the 
avallabllity, at all. times, of at least two processors 
capable of executing both tasks. Such an assumption is 
hardly valid for microprograms since an MO is executed 
by a specialized and (usually) unique unit. Hence the 
condition U;n ah =Bo must Devexplicitivestated. 


Kleir and Ramamoorthy also pointed out that 


Tomasulo's algorithm for multiple hardware units [70] 
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could be adopted to the analysis of microprograms. The 
basic idea is to examine the data flow and determine 
which outputs can be fanned out to memory elements in 
parallel, thereby eliminating temporary storage. For 
instance, consider the micro-operation sequence in Fig. 
2290 lOO. 

Since a gating operation simply transfers data 
between resources, the data in R3, Rey and Re are iden- 


tical arter Ua has been executed. If however, the 


result of the operation Ry a R, (ia Wy) could be con- 


currently fanned out to more than one sink unit, W413 and 


Hia 


ing sequence is shown in Fig. 2.4. 


Hy, ? R3 + R1 + R2; 
Tey Gus ee ee a : R3,R6+ RL + R2 
oka ns, Tape e aebconissl Sie 
Hig : RO« RS: Bee R5+ RO ARI; 
Hen eRo RG VaRE, sue Sip7ic/eeeeRT: 
Wig R7 + R3 +Rl; 
ee Ba Dal 
Input to Tomasulo's Output of Tomasulo's Algorithm 


Algorithm 


(1) The notation used here is based on Tsuchiya's SIMPL 
language [54,71]. Whenever convenient, I shall use 
Ghis notation wn, COU JUNeCLIOn With, ‘Or as) an alter— 
fates tO, i254)" 


would become redundant, hence deletable. The result- 
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Actually, since this particular strategy was pro- 
posed as a means of identifying redundant (hence delet- 
able) MO*™S, it belongs more properly to “vertical” 
optimization. Its interest in the present context lies 
in its utilization of knowledge of the potential paralle- 
lism in the machine data flow. 

Kleir and Ramamoorthy's condition (2.6) are how- 
ever, not sufficiently general to include parallelism in 
polyphase systems. A more complete analysis of the 
problem, taking polyphase schemes into account was subse- 
quently reported by Jackson and Dasgupta [39]. The main 
result of this analysis uses the following notations and 
COnNCeDES: 

FOr arparraOreMOus, ae the relation Ws a he is 
Sava to hold 1e bes OE = ) HSS Es O}) ene eer 
aqdg1 lion; .-ne condition Weis) Sie = >) is satisfied, then 
Uy see Pmibuweavelve cuen, Wa B Us implies that Uys are 
dala independent. 


jhe Pomeope fel jeyeWiper Cope (ulOhs Links the time validities 


Vi 


grVye are identical, or they overlap, them this as de= 


noted by Vin V5 7a Ores & Vi precedes V5 with respect to 
the reference machine cycle, then y Soe (or Vinee 
Furthermore, WS Ve Or eG implies Vin ue =O. — Fanally, 
if an MO Use precedes an MO ne in an SLM, then Ho Ha 
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ing conditions are satisfied: 
(2) V5 Ver ase 8 ee ee = ad)k; 
(72) (Veg Ee Madi 
Cia) (V.< Vj) A tu, ou). 


Statement (i) of this definition simply lists the condi- 
tions for simultaneous execution of two MO's. That is, 
if the time validity fields intersect, and the MO's are 
in the same microinstruction, then there must be neither 
unit conflicts nor data dependencies between the MO's. 
The other two statements merely relax the conditions on 


hardware resources in order that the MO's be parallel. 


Deiins Clon. 2z 


A pair of MO's Mag in an SLM satisfying us < us 


are conditionally disjoint, denoted Hay es provided that 
(Vee Vi) A v (u,4 in) : 


This definition in fact states the condition under which 
a pair of MO's may be placed in the same microinstruction 
even though their resources are in conflict. For example 


consider the following: 


Wy < GATE, {A }, {B} ,——, V,> 


Uo ADDI Ey Gee, TADDE RA, V5? 


U., are disjoint, denoted Ws Su, if any of the follow- 


22 


oo a) % i - “=? 
tanh i‘ ae ~ ‘79 
| . .. a ny a 

ae | : a ome 


oe key 5 eek Sala 


pr att 26 ¥ ar eld y ¢ be ais ZS Bip 
7, a "a WwW - be ttex Ae a 
> 8 7 ae 


7 


oben ‘é ht. <a et 


“Ae oH) AUN, > iw -“ 


agall yignis nottte! toh elds’ 0 tit — 192 


s6 waned ee x64 Sr oid 
et Jett .3'OM ow? 20 Agi tus gx5 BoQenagt ia ms | 
7 ¥ 

: apiest voi biisy 3: at 

eye 20M edt bas .tosetsdat sifor+ a ) cay. omit 3 - 


-haoo sit 


a ee Y 


eric thee aon tae? oroim omnes . 


sofisian sd dexum 9teds & pl 
; _ > 4. @) 

| -2°OM oiff neswied-asionsiasges 8360 aon soins 
: be “a? 

fo eatobt Lit od ons eels wlsish auinsino teste obit “pat on 
“ Sy two2es StHW ‘sf 
r 4 2 “_ a DLT We ‘ : 
,toliatsq at x0" sty tefl soho (tk 3 vanes ~ - 


Ss 


‘ - 
Sela 5u 2 eA 20 7r8g . x r 


-~ 


eH > 4 paivicisne mlz tes rss 


Saris hehitoka fas Y 54 Sodoteb ,intocect hes vite naitatoage 


he 
iret 


be 7 


. . Ce om) e -h Me a 7 


tae 


_ seanaitied y aoiantnge odd ae AOI53e joe nb nous 


che a 


~ieq 1 


we 


a beashy sel Yom s ‘a te a 
ia — i eek eal 


r > orn: | ee 


Notice here that SC. 0 SK, # >; however, if iS V>5 then it 
is immaterial that the sources and sinks intersect since 
Wy will be activated (and terminated) before Uy begins 
execution even when they are placed in the same micro- 
instruction. 

Based on these definitions, the conditions 


necessary and sufficient for pairwise parallelism 


between MO's were obtained in [39] as follows: For 


a pair of MO's Pole in an SLM satisfying es Use Ws and 
U, are parallel (denoted anee iff 
(u, 6 us) V (uy us) (27) 


For a proot the Treader is reterred’ to, 139]. Considering 


the SLM specified below, it can be seen for example, that 


HW, =< ADD, {5,6} Ad) ee ADDR Ria, I, > 
LU, = < SHFTR, ite. Avie, . SHIFTER | I, > 
eae CALE gs up @ ELAR be - Paley ee (2223) 
oy ees, Oy ee Od way ~ , Il, > 
U, =< GATE, CaM) eee SRC oe ~ , Il, > 
Tesco 
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205) figorithms for identifying Parallelism in SLM’s 


Procedures for identifying parallelism in SLM's 
have been developed in recent years by several authors, 
notably Ramamoorthy and Tsuchiya [54], Jackson and 
Dasgupta [39], Tsuchiya and Gonzales LAI enaisl igenby toe: eV 
[77]. These are discussed below. 

In assessing these algorithms, one should keep in 
mind the following three principal measures of performance: 
(i) the generality of the algorithm; 

(Glew) 1s Optimality >: sand 
(ii) the complexity of the algorithm. 

By generality, I mean the extent to which the 
algorithm is applicable to a broad class of machine 
structures, timing attributes and microinstruction forms. 
This is necessarily a qualitative measure. For instance 
if an algorithm ignores the machine's timing characteris- 
tics, then clearly it is applicable to monophase systems 
only. 

By optimality, £2 am’ really referring to che aigo- 
mich. Ss eoptimi zing cCapaol Wlty, 1.6.) Now closes Lhe (OucpuL 
set of microinstructions is to some "minimum". 


Given-a (Canonical) -microprogram 


o iy Un see Uy? 
and two optimizing algorithms Aj Ayr Ay will be said to 


be "more optimal" than A, sie Ay when applied to S produces 


24 


a! 


+07 


a ia | 
B’ Mite ot meetoLtansa saheteinabi 


4 
il 
? 
ard 


: 


exodus taeven vd aasey Jisnet hi beqoleves asad ove 
6o8 noalost {be oekdoue’ Sas vilssoos ae ere 


fs 49 usy /bos , {S07 dolssnoh dns cyidowar , (ee) atque ead | 
- ova eeot i J 


Woled bses:'s. 


seeeen Of 


a 


mk qest Bivune no .emdtivopls S291 


woq to setuesda Leionts vaio sine sel 


seannemick 


te ee vit PeTSsiep add 


smitsmools 2s 
f Leamerao ett 
: pitetzapis att Yo yolxpietoo ws wy 
‘ ‘ ia 
| ands dordw ot 2aedee ooo nsem ! Gt tE BE REE a = : 
. . ’ ; : a 
dattiesm to Pebdp, baose « oF oh a itaqe .4at oo 


2 
. aro? aortabursdaloxs ot hive eStvudivade const aoturtousse 


* ut 7 ; 
gonawenai 227 «eo TuRein ayiMsgitaup.é Y¢bhitseeaonn at eiet 
~abrenoeserto catinid a * cue tnoe n of4 agonpr: miZ izropis: 16 a. 


amciays saernconen o3 aldsoifqqs ai +i yitests nedt , 


“cols ont 03  eaieaton Vibhigtes ms I os Lae $g0 ya _ 


——— = 


guasue sit seoln wort veo t ~YItLidsgso Srtinekme $ao ner tt 
"maatir in": omoe og ek anol sousenzonnkwn > D Joa 
r a naxpotdoxoim = _ 

} w Te) Ae 7 
- a) anneal Poe 


er | 


i ia 
<0 
7 Pai #1 o. mise < ye - 


athe 
P — os 
7 oo a oe it cre: 8 ae 
ee ae 


: G 


25 


N, (S) MiCroInsStructions, A, when applied to S produces 
N, (S) microinstructions and N, (S) < N,(S). 

Now, given S, equivalent microprograms may be 
produced by reordering (permuting) the MO's of S. By 
"equivalent" is meant that for all Ind Padi puts. LOreie 
microprograms, identical outputs are produced. Hence 
A, achieves greater optimality (with respect to A,) Lt 
A, (implicitly or explicitly) transforms S into some 
equivalent microprogram S', and A, transforms S into some 
equivalent microprogram S" such that N,(S') < N,(S"). 

The reader should note that I am considering 
reordering only, as a means of transformation; another 
source of transformation is to search for a sequence of 
MO"S (say S*) fromeall possible sequences of MO*S such 
that S* is computationally equivalent to S. Such trans- 
formations will be ignored here. 

Finally, by complexity, I refer to the computa- 
tional complexity of the algorithm. An appropriate 
measure of complexity in the present context is the 
number of pairwise comparisons of MO's performed by the 


algoricim, as a sLUNCL1on Of sene SIMlSelength. 


2.3.1 The Ramamoorthy-Tsuchiya Algorithm [54] 


This algorithm (henceforth denoted as the RT 
algorithm) is composed of four phases as follows: 
Phase 1: The input SLM is scanned, and the data depen- 


dencies between MO's established. Using this information, 
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a dependency graph is constructed by the method developed 
by Ramamoorthy and Gonzales in [31,53]. Thus, for the 
sequence shown in Fig. 2.5, the dependency graph is as 


Prd ca ced 81111 GO eee Oe 


Uo, ? ACC aR M3: 
Uo R4 < R2 AM3; 
U5 3 ACC < R4+ACC; 
ese 2a) ACE 
Uo5 3 Rie R14 
Uo6 RZ ~<— R2 (| M4- 
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Uo7 eACG iano 
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RT Algorithm: Input Example 


Note that the graph is not based on the data 
independency relation "8", alone. For instance 
av (55 Bus¢): yet according to the dependency graph 
they are independent. This is because, in deriving data 
dependencies, the algorithm assumes a specific timing 
SCHeEMe wu hi Gnpre sl) sae bus, see  COULEN Ga Ol Ro Wi eto ye 
been gated out (in Uy) before a new value is gated into 
R2aGin U6) > 
Phase 2: The earliest and latest possible execution 


times for each MO are determined using the method 
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Dependency Graph for the RT Algorithm: 


An Example 
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described in [53]. For the particular Sxanp le OLcerag .2,0', 


the earliest and latest times are shown in Figs. 2.8 and 


2.9 respectively; here, t,t, and t designate three 


successive time "frames". 
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Figs Fig. 2.9 
Earliest Times Latest Times 


Phase 3: Critical MO's are defined as those MO's which 
occupy the same time frames in both the earliest-time 
and latest-time tables. In this phase, critical MO's are 


identified (asterisked MO's in Fig. 2.9). 


Phase 4: Each set of concurrently executable MO's are 
assigned to a single time frame, the latter designating 
aesingle clock cycles G@ritical MO" s*aim thessamestame 
framelvare first Compared for unit ‘conflicts ;) at rcontlicts 
exist they are ordered so as to résolve these conflicts. 
For instance UozrHan use the Jogic unit, hence they must 
be®placed in different tsime frames inspite oftheir "data 
independency. The critical MO's alone give rise to the 
Gime: Grames shown Jeb g sees Ole © ihe non-critical MO's 


are then placed in the earliest possible time frame that 
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does not generate any resource conflicts. 
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EA eee sO Big eee. 
Time-frames for Final Output from the 
Critical MOVs RT Algorithm 


The final output produced by Phase 4 is shown in Fig. 2.11. 


The RT algorithm is essentially an adaptation of 
Ramamoorthy and Gonzales' method for detecting parallel 
tasks ina multiprocessor system [53]. The reader will 
note that data independencies and unit conflicts are 
determined in separate phases; this will reduce compu- 
tational time to the extent that unit conflicts between 
critical. MO's need to be resolved only if the critical 
MO's are in the same time frame (are data-independent). 
However, conflict analysis involving non-critical MO's 
May not be so economical; e.g., Uo¢ though belonging to 
t. in the earliest-time table (Fig. 2.8), 41s eventually 
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compared with at least one MO from each time frame. 

From the viewpoint of generality, the RT algorithm 
is limited to the extent that a specific assumption is 
made. about the timing constraints) (Fig7(2:7).inWwhile this 
assumption is valid for certain machines, far more com- 
plex polyphase timing schemes may also exist. The appli- 
Cabality of the algorithm to.such systems) isi noteaty ail 
evident. 

The microcode produced by the RT algorithm is non- 
optimal. However, I shall describe below an extension of 
the method which attempts to produce optimal output. 
Finally, a worst-case analysis of the algorithm indicates 
that it requires Oa) comparisons, n being the length of 
aneinputesiM. ihe critical phase: nere 1s Bhase 2, where 


all the MO's may have to be compared on a pairwise basis. 


2.3.2 The Jackson-Dasgupta (JD) Algorithm [39] 


This algorithm is based on the results discussed 
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isa) there is a labelled edge from WH; to u ule Us Y Ha. 


j j 
In other words, if an unlabelled edge from Wy to 
re exists, then Uy must precede ad if a labelled edge 
exists, Hy and Me are conditionally disjoint (Del. 2.2). 
AS an example, consider the SLM of Fig. 2.12 below. 
Assuming I, < Io, the conflict graph will be as indicated 


in Fig. 2.13. The labelled edges are indicated by 1's. 
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Hz : < GATE, {Cy tDs Piles 
Hy <SADDi eUD?pE sy Die ALU bee I, > 
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ie < GATE stake, bo ae DELS 
4 3 < OR, CD-R et he eee A LUee I, > 
Ney aos GATE; {FF} ,-1Gl—, , Il, > 
FilGiwec ot 2 


Input SLM to the JD Algorithm 


Phase 2: From the conflict graph, sets of parallel MO's 
are extracted iteratively as follows: 
(iy) If V = 6 then stop else 

construct a set I of vertices from V such that 


the indegree of each vertex in I is zero; 
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(22) While V-I contains a vertex Ws satisfying 
(a) all edges terminating at HW, are labelled, 
and 
(b) all edges terminating at We originate from 
ne 
then I + Iutu,}; 
(iti) Output 1 asa (set. of —parallel MO’s- 
(iv) Form a subgraph using the vertices V-I; that is 
pV etre a earrsAa( Vin) ee View) 


(v) GOcOalclsle 


Intuitively, a vertex ae of O indegree implies 
that all MO's that must precede it have been placed in 
an earlier microinstruction. Thus MO's selected in step 
[i] of each iteration are pairwise parallel by virtue 
of the 6 relation. MO's selected in step [ii] are con- 
ditionally disjoint to some MO in I and have no conflicts 
with any other MO in the graph; hence, they can also be 
placeédtin I by virtue *of thelyerelation. “Since the graph 
is reduced after each iteration, the algorithm finally 
terminates when V becomes empty. For the example of 


Fig wec.i2, the output obtarmed=1s; 
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The JD algorithm is more general in its applica- 
bility than the RT algorithm since the only assumption 
made about the underlying machine structure is that MO's 
be representable unambiguously as 5-tuples (2.4). Note 
that timing schemes containing any number of phases are 
permissible, and that phases may even overlap. 

Like the RT algorithm the JD algorithm is of complexity 
(n>) (where n= length of the SLM), since 0 (n*) comparisons 
between MO pairs are required to construct the conflict 
graph. Given the conflict graph however, Phase 2 of the 
algorithm can be rather efficiently implemented [21], by 
following the ideas proposed by Knuth for his topological 
sorting algorithm [42]. The timing of the fM@lgorirthm is 


given by K,N + K,M where N is the number of edges and 


i 
M, the number of vertices in the conflict graph, and Kis 
Ko are constants. 


Like the RT algorithm, the JD algorithm does not 


attempt to optimize the microcode. 


203.0 The Tsuchiya-Gonzales, VIG)S Algorithm [72] 


This is a refinement of the RT algorithm with the 
objective of producing where possible, more optimal code 
than is produced by the RT method. 

As in the latter, the SLM is partitioned to indi- 
cate the earliest and latest execution times. However 


within each time frame, MO's are further partitioned by 
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their resource, types. The, information required for this 
step is obtained from a resource usage matrix R whose 


rowS correspond to MO's and columns to resources, and 


aie duel es ie Hs Uses resource = 3 > 


li 


0 otherwise. 


Consider the dependency graph of Fig. 2.15: its 
earliest (E) and latest (L) time partitions are shown as 
Fig. 2.16, while the-parttteioning-of I according to 
LesoOurce: Usage, 1S Indrcated in Fig. «2eiv. 

Thus, (uo, Wor Vaan Simply indicates that the 
MO's Wor Hyg) Hyy all require resource A. 

The algorithm is best understood by applying it 
to an example. Consider for instance, the example repre- 


sented byubiq. 2,15. 


Sten Wii: Ly is examined and is found to contain only 


One MO. The corresponding partition in E, viz., Ey is 


then scanned, but since there are no other MO!'s in E 


1’ 
the microinstruction I.= {u,} is constructed, 

E = {E,, Ey, Ez, Egs Egy Eg} 

E, = (4); E, = “gry gr hyq)? 

Eo = (Ugrtgety) Oia aie 

Ez = (grtgrlgrlg)? Be = (yg)? 


Migs eel eel sa Gey) 


Earliest time partition for the dependency 


graph of Fig. 2.15 
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Fie Ly = (erty Ug My gray)? 
L,= (U5.Uy)i Le = (Ujyorty3)? 

ree TON alas lr Lig = (Uy 4)? 


Pug. 2206 (bp) 


batest Lime partition for the dependency 
Gmuapngoleeige 2.15 
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Partitioning OL 


Stee. i21* Similarly, the two MO's in Lo are conflict 


free (from Fig. 2.17), hence they are both placed in I,- 


ee 
a 


a tee et 4 teeve? “4 cad 1 . ig? vg (oll ; 
( te t uy) \ { iad we —_ ro 
7 el oA f f A fit or . -_ | 


E, is scanned; it contains 3 which contliciesiwith U5 
Since they both use resource B. Ha is thus tentatively 
placed in the next level L (in this case L, already 


=) 
contains U3). 


Step [3]: L is scanned. Mo's Ue and Wg are in contiacre, 
hence execution of one of these has to be delayed. Before 
tias: choices made, E, is scanned for some MO which is 
conflict free with either Us OF Ug. The possible candi- 
dates are Ue and U4 and since U4 CONE Picts wien U3 (from 
the dependency graph) it is rejected. One the other hand 
Ue cContlicts: with Up but is concurrently executable with 
both 3 and Us Thus Ug is delayed. In effect, U¢ and 

Ug are interchanged from their original partitions L3 and 


L The output from this step is I, ={Ug, Hs 7 Ugh while 


4° 
Ug is tentatively placed in L,- 

Step [4]: Because Waa is data dependent on Ugr Wyq is 
transferred down one level to L,. L, now contains Urge 
Ugrlyo- L, is examined and it is seen that U4 and Ho 
are in conflict, hence one of these must be delayed. It 
turns out that the choice can be either. Supposing Yio 


to be delayed, the output produced by this step is Ty = 


{as Ugs Ugly while L, contains Uy9r Vy Hye HyW3° 


Step [5]: The dependency graph indicates Hao and Had must 
ively. Thus an additional 
precede Hy2 and U3 respectively 


level, Le = {Uy gs Hii is created; L, now contains 
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Uyor Hy3- L, must be further partitioned into {ug} and 
{uy 4} since these MO's are in conflict. Thus microins- 


tructions I, = {Uy gte Lae tuy4 are obtained. 


Step [6]: The remaining partitions L, and Le are examined. 
Since there are no further conflicts, the remaining micro- 
instructions are obtained in a straightforward manner. 


ine; linal form of the output 1s Shown in Fig.e2.18 


Thea TEGgalgorsehmgis,agheuristicnalgonri thmessin 
particular once L has been partitioned according to resource 
usage, resource: conflicts are resolved on thelmsis of 
heuristic rules. Unfortunately the heuristics used are 
not so clearly stated making an analysis of the algorithm 
difficult. For instance, to select MO's that have to be 
delayed due to resource conflicts, a rule is used that 
the first MO's to be delayed are those with only one 
successor. If additional micro-operations need to be 
delayed, then the delay is determined as a "function of 
common successors". Just what exactly this "function" is, 
is left unspecified. 

Since resource conflict resolution does not appear 
tovtake timing into consideration, at is Uikely that the 
algorithm is applicable only to monophase systems, 

Because the optimizing strategy is localized - 
MO's are moved from one time frame to an adjacent time 


frame but no further - the optimizing ability of the TG 
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algorithm is also limited. However, it is instructive 

to compare the performances of the RT, TG and JD algorithms 
om Ehe same input (Fig. 2.15). The output produced by 
applying the RT algorithm is given by Fig. 2.19; an 
additional microinstruction is required. Applying the 

JD algorithm on the other hand, produces the same output 

as vis produced by the TG algorithm (Fig. 18) ethis: can 

be verified by examining the conflict graph that would 

be constructed by the JD algorithm using the information 
provided in Figs. 2.15 and 2.17. For the sake of simpli- 
city (since there are no labelled edges for this particular 
example), the conflict graph is represented by a binary 


(eaqacency) matrix A (Pigs 2.20): 


ee, Th ts ay oe a [ey 
= otherwise. 
te = {uy} 
ie {u,} i = {Ue uy} 
I, = (uo, Hy} es {uss He} 
I, = 3, Use Ue} Ty = tus, Uys 
vee {Uae gs Hg} I, = {ugh 
I, = {uy} lle etl epee Way 
We = {uy eS a 
Tz = {uygr Hy3} Tg = tWjor Hy3) 
LB oy thie liegt Ble 
PUG eee eo LEW feo See N 
Output produced by the Output produced by the 


TG algorithm RT algorithm 
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Matrix Form of the Conflict Graph 


223.4 the’ Yau-Schowe—-Tsuchiva (YST) Algorithm [77] 


Prior to describing this algorithm, a few defini- 
tions are necessary. 


A data available 4M@O stan MOstor which all “iols ton 


al at = T gg’ of 


- 
; 


which it is directly data dependent have been assigned 
to microinstructions. A set of such data available MO's 
is a data-available set. A complete microinstruction is 
a microinstruction to which no additional MO (from a data 
available set) can be added without causing resource 
GConriices. 

As in the previous section, I shall describe the 
YST algorithm through an example. Consider the simple 
agependency graph of Fig. 2.21, in which the ENTRY and 


EXIT nodes are assumed to be such that all MO's following 


ENTRY are data dependent on it, and EXIT is data available 


only when all its preceding MO's have been executed. 


Ponies. 


Dependency Graph for the YST Algorithm: An Example 
y 


Detection of parallelism will then be confined to 


the set M* = fuze Ugr Hae Wy}. The resource (unit) con- 


flicts between the MO's are described by a set of conflict 
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Statements, which in terms of the notation of (2.4) are 


for this example: 
Uy n v x o 
Up 9 U3 # $ | (250) 
Ur 9 Uy # O 


From Fig. 2.21 we observe that the lower bound on 
the number Of microinstructions 1s 2. The vaimeot che 
procedure is to derive a set of alternate microinstruction 
sequences and select the sequence closest to this 
"computed" lower bound. To reduce search time, the 
algorithm uses several items of information to decide 
whether to terminate searching for a particular sequence 
or not. In the following description, these termination 
criteria are indicated informally. For further details 
the reader is referred to [77]. 

Given the dependency graph (Fig. 2.21) and the 
conflict statements (2.9), the YST algorithm proceeds 


as follows: 


SECO bac H; € M* is placed in a separate (temporary) 
partition: I, ={u,)}, Me {ust ine {uzt, Lh Sy 
Ee Ura ets I,}. |p| denotes the "current" upper bound 


on the number of microinstructions. 


Step [2]: Generate the data available set D and the data 
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non-available set D' = M*-D. Here D = {uys ug, 
' — 
D {u5, uyt. 
step 13d; Select from D, a complete microinstruction. 


Here tu, t and tut are possible candidates. Select any 


one of these arbitrarily, say {uj} =e andes. tei te 


Is U 


where I (initially empty) is the set of complete micro- 


SY 


instructions already generated. On completing this step, 
he= iT.t. The remaining elements in D are saved in a 


separate partition I¢- 


Step [4]: D is enlarged with those MO's in D' made data 
available as a result of the selection in Step [3]; the 


Same MO'S are also deleted from DD’, | Thus Di — {uy ust, 


D!' = {u,t. 


Step [5]: Repeat Step [3]. Two trivial complete micro- 
instructions pee {u,} are possible. I, = tu, J is 
arbitrarily selected and I* Tut. On completing this 


step, I = Ie, Boh, Dae tu.) site (ON {uJ is saved. 


Step [6]: Repeat Step [4]. However since uy is data 


dependent on H3 and 2 isestill D, D¥yand D! remain 


unchanged. 
Stepelsje Repeat Step (3) =tom Dp = ee This results in 
i {Ie, Io, Ig}, where Ig = {ust, On repeating Step [4], 


D {tu}, Bh SSor, 
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At*this stage, “since D = {ug}, DS eee erel, 


repeating Steps [3] and [4] would result in i ea 


The algorithm stops pursuing this particular sequence 
since it anticipates that the number of resulting micro- 


instructions would be |P| = 4. Instead: 


Step [8]: It backtracks and selects T¢ = {uz} Saved in 
the first iteration of Step [3] as an initial complete 
microinstruction. Note that this selection re-initialized 
Deand =D" to D = {uy U3t, D' = {Uo s ugt. On iterating 
Steps [3] and [4], the algorithm obtains sr — {I¢, Thor Tj,t, 
Bee Ua Ligte eae ge ly = one ehus ae le 


and the output is nearer to the computed lower bound. P 


Hseset tol. 


Stevelol: “Since | P| > 2still, “and there remains aimicro— 
instruction choice saved at an earlier stage (viz., T, = 
{ugh), the algorithm backtracks and selects Ig = {uz} as 

a possible choice instead of Daas {Uy I. For this back- 
trackrto be effective D and” D'vare re-initralized to 

D = {5s Ugty D' = {ute On iterating Steps [3] and [4], 
the output produced is I = {Ice Ig, Tiote I, = {uz}, 


I {uj}, and D = {ug}. Since i =) Pl and 


Supe 3)? 2126 
D = », pursuance of this sequence is stopped. Finally, 
since no other microinstruction choices remain, the 


algorithm terminates producing as its result, the output 


from Step Siade 
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Search Tree Generated by the YST Algorithm 


A schematic view of the search tree generated by 
the Yor algorithm 1s shown by Fig, 2,22. In practical 
Situations the size of the solution space may be reduced 
considerably, by the data dependencies and the potential 


parallelism between MO's. In the given example for 


instance, possible sequences in the solution are <UyUgU3H4?s 


<UyU3HoHy4?s <U3 (Uy rly) Uo?s SU3UyUoHg> and <UZUguyHo?- 
Of these, the last two are never generated by the search 
process because once U3 is selected as the first complete 
MECTOINSLLUCELON, Wy and Uy will always be placed together 
as parallel MO's,. 
In the worst case however, the number of nodes 
generated in the search tree will be exponential inn 
(the length of the SLM). Since each node will necessitate 
at least one pairwise comparison between MO's, the worst 
case complexity of the algorithm will be O(K") for some K. 
Such a situation arises for example with the follow- 


ing sequence of MO's: 


Wy? A+B+t+cC U, 1 UZ x 
Ho 3 Dob her Uy n U3 x (2280) 
H3 G<«+B-E U, 1 UZ #o. 


Here Wy rHorh3 are mutually data independent but because of 
the unit conflicts, the YST algorithm will generate all 3! 
sequences of MO's. 

A heuristic modification proposed by the authors 


to reduce search time is to attach a weight w(u;) to each 
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vertex Us in the dependency graph; wu; ) equals the number 
of MO's that are data-dependent on Hae Furthermore, the 
MO'sS in D are ordered according to the input sequence. 
Complete microinstructions are then generated starting 
with the first MO in D, the second MO, etc., and a weight 
wee! is assigned to each such microinstruction generated, 


this being defined by 


ee gee ieee (22 118) 


Het 


The selection criterion in Step [3] becomes (instead 
of an arbitrary selection) that microinstruction with the 
largest weight, the rationale being that the selection 
will probably free the largest number of data dependent 
MO's for subsequent selection. 

The YST algorithm being an exhaustive search pro- 
cedure, guarantees an optimal sequence of microinstructions. 
When the heuristic method is used, optimality may not 


result. Finally, the algorithm ignores problems of timing. 


2.4 Summary 


This concludes the review of algorithms for detect- 
ing parallelism in SLM's. To summarize, the JD algorithm 
appears to be the most useful from the viewpoint of 
generality and algorithmic complexity. The YST algorithm - 
within the context of monophase systems - guarantees a 


minimal output (note that the JD method if applied to the 
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example OL Fig. Zee lwandmei2wo) wlll note produce an 
optimal output). It is however potentially inefficient. 
The TG algorithm is also applicable only to mono- 
phase systems, uses heuristics and attempts but does not 
guarantee, optimal output. The RT algorithm is less 
general than the JD method, and may produce a lengthier 


output than the latter. 


’ . a ~~ 
WT % ka Ny 


Annee? 
a cement ara 


as 
7 


eee mae 
Jon, aaob aud 334 ONG = ee 
asel at Peet o7 oo me ae 
asidsone! 6 sjuboTg, yam oy a: vim | 


“ stt80) | B : 


CHAPTER III 


POTENTIAL PARALLELISM IN MICROPROGRAMMABLE PROCESSORS 


3.1 Introduction 


the algorithms described in Chapter II serve to 
determine or "expose" the parallelism within SLM's. 
This parallelism originates fundamentally, in the fact 
that within a machine's data flow, several operational 
units and data paths may be concurrently active without 
any mutual resource conflicts. The primary objective in 
WlLiaizing a horizontal Microword GrganiZation 1S) to) take 
explicit advantage of this data flow characteristic 
[Sosy 

I shall use the term potential parallelism to 
denote this general characteristic of parallelism as 
embodied in a horizontal microword organization; the 
degree ,of potential parallelism a is defined as the 
maximum number of MO's that can be executed from a single 
microword. 

In the present analysis, I shall assume that the 
eontrol memory (CM) is avreguilar) array in the sense that 
Shletts words) are Loentical ly organized (hig. 2.1)... lous 
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specific subset of MO's have been specified (by the micro- 
programmer) to be executed from Wh: Wi can thus also be 
viewed as a state variable whose individual values, the 


states, denote possible microinstructions that can be 


stored in Wy. Given a microinstruction es the deqrecson 


actual parallelism of Ho De (ta is simply the cardinality 
of the MO set comprising shag Thus while o. is invariant 


for some given microword organization, Di may (and in 
general, will) vary from one microinstruction to another 
(Fig. 3.2). However, 2 denotes an upper bound on Do: 

As I pointed out in Chapter I, potential parallelism 
becomes significant as a concept in the context of writ- 
able control memories (WCM's). In designing a microword 
organization for a WCM, the nature of the user micro- 
programs will not be known to the designer. Thus, enhancing 
the microword potential parallelism is clearly one of the 
most important feasible design objectives. 

The purpose of this chapter is to examine the formal 
nature of potential parallelism and a means of maximizing 
it; and to analyse its effect on the control memory word 


Size. 


Si2 Analysis of Potential Parallelism 


Let ue = {uy elgre ee rtgs denote the set of all MO's 
in a microprogrammable processor. Then informally, 


ee u* ‘are defined to be potentially parallel, denoted 
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Hs LTS By if their executions involve no conflicts between 
their resources. More formally, defining the resource 


independent relation o as 


ROuier1 & 4 : , i * 
Hy Ib i Siete, (U,n U; >) for Bele 2 pectien 


[V5 Nene o]V (Vn Ue (Uo ua] (3 cle. 


Notice that if Vin ites Oy et 


A Me will always be executed 


in separate clock cycle phases, hence there will be no 
mescource conflicts. On, the other hand, the condition 
a aa o) ACh, ne) means that though Was me are executed 
in the same phase, they are conflict free since they use 
disjoint resource sets. 

Recall) Ezom G1), the quantity Be The problem of 
interest here is to maximize oe From 103.0) » noke sthat. che 
Whe relation between Ha, HAE u* is determined by their 
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respective resource sets R., ee 


i % 
Suppose for some pair uj, M5, (us Ti Hy 
at a SY eee More particularly, assuming that 


and time validities NEST wee 


). Then evidently 


time validities have not been (yet) assigned to MO's, then 


: such that V (Hu; 9 ee if we could 
assign the time validities V,, He such that V,;n ae we 


fovea pad OL gious Her U 


would force Wan Me to become potentially parallel. 


I shall call this, the phase allocation problem. 


Note that it implies a basic assumption: that the machine 


\aboves esotiin 


soxyouon and 


ite} | @ © yu) AGRE 0) aE 


hatuosxo od = lis Cr Lw aa 174 we av rk ab . 
ot sth ifiw erste suas yaossitg sfoyo Wedotie 


nottibnos act , Basi tatise en a0 _esoutnage eGawosiea | 


ms #4 1 
bagnooKe 928 pst (pues sons encom [ tu Mh Re ic St 
gay yodd opria £932 sor ikdon sis esi aaetid, staat woele 


vedo sotidash oe 


to malderg: orl ‘ai witingup Sud. (L.8) mos tr A 


ad’ 96019 sion, (1.€) mas’ “qn ssigixem ov. eis aT 
tisns ya Dbontgret9b ar *1 7p “4 cnatet-a 
op¥ + E¥ asidibilay omit bas pA on pt aise oomtiogex, 3 
meets reat “4h all sof), fF ve wu +e afseq cola 
dort pittnn2ss i co = ne wh el Ss} aia 
non: (200 it i> rod 29h avBL 
bisoo aw Yi eB gant i dus aK 44 eee 
ow \) = Cun ens , i 


be) S. 


cycle follows a polyphase timing scheme. However, the 
approach developed here can also be used to assess the 
feasibility of polyphase schemes - an aspect which I shall 
further discuss later. The phase allocation problem is 
stated more precisely as follows: 


Let u* be a set of q MO's with equal execution times 


tht Determine and allocate a minimal k-phase clock cycle 
C=<Tl,, Ig, ---- WL > (322) 
where Iso Heap Ss? FOr lane et (II, ) ="tCMPoOr tates et (1; ) 


denoting the duration of phase Tes and t>ti such that 

the degree of potential parallelism 2s = q. 

The objective then, is to make all q MO's in ,* 
pairwise potentially parallel. Observe that DS =q is 
obtained trivially by allocating each Us eux to a separate 
phase I, - In that case k = g and we obtain a q-phase cycle. 
This solution is neither minimal nor practical. 

The procedure developed below uses the following 
Heuristics: 

(a) All MO's utilizing the same operational unit are to 
be allocated to the same phase. 

(b) All MO's which can be allocated to the same phase 
WECnOULEVLOlalLings tne nee relation will be so 
allocated. 

cc) Pairs of MO's not satisfying (a) or (b) will be 


allocated to disjoint "phases. 
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Rute (ajy-irs-essentially 4a "realistic" hardware constraint. 


For, let 


ERS 


us =<OP;, SC;, SK, U 


be a pair of MO's with unspecified time validities (denoted 
by '2') using the same unit U; clearly if they were to be 


assigned the same time validities then aA Ge ences Prete 


(re 3) 
the other hand if they were assigned disjoint time vali- 


dities say Wis = Ny, Me = Io, then when yu; is executed, U 


will be activated in phase [I and when Ue is executed, U 


1! 
would be activated in I. - In theory there is no restriction 
on such a timing mechanism. In practice the complexity of 
the resulting circuitry would be prohibitive, hence the 
imposition Of rule (a). 

In order to apply this rule, consider, thessetsot 
MO's u*. Partition u* into a disjoint subsets rote ene 


such that for any Was Mar UL’ Ws utilize the same opera- 


J 
tional) unit. .Callesuch, alset,. arunie equivalent set. 


An example of such a set is: 


a ADD 2 eMl, (M2h, 1M3 },. ADDERI, 2 > 
Hy = < SUB, {Ml, M2}, {M4}, {ADDER}, ? > (2457 
3 = < ADD, {M3 M4} {M437 <{ADDER},.?2.> 
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unit equivalent MO is defined by the 5-tuple 


< OP;, SC,, SK,, Us, 2 (ge) 
where OP. = at OP SC i te SK at SK 
ae ae Ot A = Oey 7 = U o ) 7 
= j=1 a5 ue j=1 a5) 4=1 43 


U; sees G\-s 1,..-,k;) 


For example, given the set (3.4), the corresponding unit 


equivalent MO is 


“VA DD SUB), IML, M2;oM3>— M4.) {M350 M4 i") TADDER i (3m )r 


Thus, from the original set u*, we can obtain a set of q* 


unit equivalent MO's, Tee 


ae = {Uys Horeseer U } (357) 3 


qe 


Henceforth I shall simply refer to members of un ase “MOlLs™;, 
the prefix "unit equivalent" being implicitly understood. 
Furthermore, whatever time validity is assigned to some 
wu, € Mer will imply the assignment of the same time validity 
to all members of the corresponding unit equivalent set 
represented by Uy. This procedure then, completes the 
implementation of rule (a). 

in the rest of, othis section,. Lo shall icontintie,to 
cepresent an MO by (2.4) except that the V field is slert 


undefined. The problem of course, is to determine these 
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Vefields for eachyMO-in Ups 
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AS SUDSe Cae Gc ie is termed a resource independent 
Glass URC) Sif for al) oars Way Use ee HW, 0 Us. A maximal 
RC (MRC) is an RC to which no other MO can be added without 


VeQuating thew relations a Ga ven Uae a set of MRC's can 


then be constructed. Denote this set by 
op = {p4, Poreces pt (ree 


Thus each PO; is a set of MO's that are pairwise resource 
independent. By (3.1) they are therefore pairwise poten- 
tially parallel even if assigned to the same phase. 

Following rule (b) then, an MRC can be allocated the same 


phase. 


Example 3.1 


Suppose on contains 8 MO's, denoted Uyreser Uge 
Let the MRC's as determined by applying the above defini- 


tions be; 


ee Tey as Way One Meo ee! (ehh 
where 
P1 oa {uy Eo Ha} Pa oa {uss eae Ug} 
' 
a = {U5, Uy Ue) Ps = {uge Ugh (Bei 01e 
v 
oe Ty ral 0,6 = tug, Uy} 


Note that the MRC's are not necessarily disjoint. For 
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instance 3 belongs to both Py and Pos 
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Arcover *(or covering set)’ 9 is atseusor MROUsssuch 
that: (1) "all the MO"s wan ie are included in 6; (ii) no MO 
appears in more than one’ MRC; and (iii) if “any“of “the MRC's 
(or their subclasses) are deleted from 8, one or more MO's 
will be excluded. A minimum cover hed is a cover contain- 
ing the smallest number of MRC's (or their subclasses). 

Given a set ep of MRC's, covers can be systematically 
discovered by applying one of several well known methods 
wsederor che Ssimplitiveation of switching Functions || 1or 4.) 7 
Minimizing incompletely specified sequential machines [41], 
or minimizing control memory word dimensions [20] (See for 
example, the next section). Thus, for Example 3.1 the 


following covers are obtained: 
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The minimum cover ce ie for this example, is of course 01: 
Tt was stated earlier that members o£ an MRC are 
pairwise resource independent and therefore potentially 
Darallel. If a pair of MOvs Her Hy being to disjoint 
MRC's say Par Ps (re. eT) then Wy, Hy are not 


resource independent; they can only be potentially parallel 
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1if,each MRC (or its subclass), in 6 is assigned: to a dis- 
tinct phase then the MO's in 6 will all be pairwise 

potentially parallel (this proposition is proved below). 
The minimum cover Orr will then determine the smallest 


number of machine cycle phases that preserves this 


parallelism. 


Theorem 3.1 


he Gt ase the number on, MMO's) an une then any cover @ 
gives a value of Oe = q* if the MRC's (or their subclasses) 


in ® are asSigned to distinct machine cycle phases. 


Proof 

Letra cover) 9 = (p57 Poreses P,)- By definition 

Rigen 

Usqe Huge © 2g satisfy W349 Hao: Thus 1 sche MOVs an Ps 
are assigned to the same phase say Is, then they are pair- 
wise potentially parallel (by 3.1); i.e. Vid te Ui for 
ee aon ae 

Aone lo. | denotes the cardinality of Par the value of 


DeeEOr« pn cLS 
p i 


Sa eee Ly eee (sete) 


Lieeach ps is assigned to a distinct phase Tl of 
somesclock cycle, C, then Ne ES = » for Miz © Pye W571 Py" 


implying that u,, | | May for all Ui. 6 py, 51 © Ose 


Pp J 
Finally, let each MO be assigned to a distinct field 


in the microword. Then all q* MO's may be executed froma 
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single microword without resource conflicts. Thus Dor th 0 


In example (3.1), the minimum cover os = GOR P3, P4) 
requires a 3<phase cycle (ise., kK = 3): 
Cs= yf Ty, 1 I.) -, 


Those tne MO "sian o are asSigned to I those in P LO 


ile 
3! Be ele provided the MO's 


are assigned to distinct fields in the WCM word. 


Io, and those in D4 ton! 


The solution to the phase allocation problem results 
in a microword that exhibits the maximum possible value of 
oe - subject of course to the fact that the MO's are unit 
equivalent MO's. Whether this solution is practical will 
depend among other factors, on the value of k (the number 
of phases obtained) and on the significance (or importance) 
of parallelism within the overall set of design objectives. 

At the design level, the problem of deciding the 
length * (duration) of the machine cycle, and the number of 
its component phases is quite complex since several factors 
may affect the decision. A discussion of the pragmatics 
underlying such timing decisions is provided by Langdon 
(451. ) Here, I shall consider only one of these aspects, 
Viz, the Lelatronship between the machine cycle and the 
cycle time of CM. As Langdon points out, the latter has 
a large influence on deciding the length of the machine 
cycle, 

Decpetne eM cycle Ermesve ey and k the minimum 


number of phases of length t obtained as a solution to the 
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phase allocation problem. The machine cycle C would then 


DemOEs Length kt. lt kt =< ite the lower bound on the 


pale 
machine cycle would in any case, be hae Thus a k-phase 
cycle can be utilized and a maximum value i preserved. 
On the other hand, if See t ae the above allocation scheme 
may unduly increase the effective CM cycle time to kt. 

Tr the increase is too high, then of course, much of the 


advantage of a highly parallel microword and a fast CM 


would be lost. 


3.3 Minimization of the Word Length of WCM's 


The foregoing analysis was concerned with maximizing 
potential parallelism. The reader will note (from the 
proof of Theorem 3.1) that each (unit equivalent) MO must 
be assigned to a distinct field of the microword in order 
that Dd, =m 

The resulting microword organization is one where 
eachsaunet equivalent set (Of MO"sS) 1s): assigned*(ie;, 
encoded by) a distinct field. 

A problem that is in a sense, dual to the potential 
parallelism maximization problem, is that of minimizing 
Ehemword length of control memories. This problem has been 
studied by several people, notably by Schwartz [62], 
Grasselli and. Montanari [32], and’ Das-et al 120)]R ampiicit 


in these investigations were the following two assumptions: 
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Assumption (a): 


~A Set of control memory words Wi Wor Lote Oe 


yi 
already available, each word containing one or more MO's 
(Fig. 1.2). That is, a read-only memory with a direct 
control word organization [38] is given. The problem is 


one of determining a minimally encoded organization [58] 


such that the microword bit dimension is minimized. 


Assumption (b): 


The problem solution ignores the condition where 
two MO's can only be activated in two different clock 
cycle phases and as a consequence, may not be grouped into 
the same field of the microword. In other words, polyphase 
microinstructions are not considered. 

Using the conditions for parallelism i2n sims (23/9), 
Dasgupta and Tartar extended the method of Das et alto 
the case of ROM minimization for polyphase schemes [23]. 
The assumption made in [23] was that time validities were 
aiready assigned to MO" S prior to the speciitication of 
read-only microprograms. 

In the present section, I shall consider the problem 
of minimizing the word length ("bit dimension") of 
writable control memories. Recall that in designing 
WCM's, no knowledge is available concerning the micro- 
programs that will be stored in the memory. Hence ROM 


minimization techniques cannot be directly applied here. 
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in the present, analysis, it iS assumed that MO"s 
are completely specified. Let nS Cie Wonass=s Ut be 
this set of MO's with time-validities assigned. Then a 
potential compatibility Class (PCC) is a set of MO's such 
thateton sll pairs pan ie Vena Bre 


3 Peg 
maximal potential compatibility class (MPCC) is simply a 


in) the PCC, v (ay | | 


PCC to which no other MO can be added without violating 
the cele: relation. 

Clearly, any pair Uae Wee Ue haless) i huyeke Vin Nie Ol; 
can never belong to the same MPCC. For, by definition of 
the Ges relataon. US. 7, ee = > implies Hy | | aa, Ae) 
they can never belong to the same PCC, hence to the same 
MECC.) ln determining MPCC"'s, this fact canbe used to 
reduce slightly the computational time. 

For the set u*, we can obtain a set of MPCC's. Each 
MPCC thus identifies a set of MO's that cannot be activated 
together in a microinstruction because of resource: cont l1cts: 
Furthermore all members of an MPCC have the same (or 
overlapping) time validities. If the clock cycle phases 
are non-overlapping, then members of an MPCC will all have 
the same time validity, hence they can be placed within a 
single microword field and be activated by the same clock 
signal. The remainder of this analysis thus assumes a 
polyphase, non-overlapping timing scheme. 


The WCM minimization problem can now be stated pre- 


Ciscly as follows: Let the set of MPCC (s corresponding to 
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u* be denoted by ¢ = {or dor recy dst where 
SG Sila c Upgreces oe paiva ees (BS Ire 
Then, the problem is to find a set $*= eee tee 
of PCC's such that (i) every MO in u* is in at least one 
PECZOL sds peand, (12) sthesquaneity 
rf * 
B= } |log,(|o-| +1) | (27erA) 
& 2 h 
h=1 
* (1) 


h v 
and |I| denotes the least integer > I. The quantity B 


is minimal, where los | denotes the cardinality of @ 


designates the cost of the minimal cover. 

Since a PCC is a collection of MO's whose executions 
are mutually exclusive, these MO's cannot belong to the 
same microinstruction. In this sense a PCC (MPCC) is 
equavalent= to the eG e(MCC) oh Das seteal | 20] t ences the 
minimization technique developed in [20] can be followed 
for the WCM problem, once the MPCC's are obtained. For 
the sake of completeness, this procedure is outlined below. 

Given»a. set: of MO's 41%, and a, set of MPGGis yo; ta 
table, called the WCM cover table is constructed, by 
specifying the MO's, Hy recer Hy in a row, and by entering 
oe below WG it Hi € ve Each column of the table is there- 
fore a collection of those MPCC's that contain the speci- 


fied MO. Note the analogy of the WCM cover table with 


sig a en oR NSS SSS SS Se a a ee 


(Lj) 1 1s-added=to | oF | to include the "NO-OP" MO in each 
field. , 
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the cover table used in simplifying switching functions 


ba9)%. 


Example 3.2 


Suppose the set of MPCC's for some specific collec- 
tion of MO's is as shown in Fig. 3.3. Then the correspond- 


ing WCM cover table asigiyven by Fig. 3.4) 


o, = (uyr Ug Hy} d6 = tus Hor Hyg? yy} 

do = UWgr Ug yy} o5 =1uUs, Ugt 

d3 = tug, Higr Hy} dg ={usr Hor Uyq) 

d4 = {ugr Hor Hyg} dg = tug, Ur Uygr Hyy) 

5 = tiger Ug} 19 = Mgr Mgr Myy) 
(Dale pawn Sh) 


Maximal Potential Compatible Classes 


The MPCC'’s appearing alone in some columns of the 
cover table are called globally essential, and the corres- 
ponding MO's heading these columns are called globally 
dtstanguished MO's¢*these are identifiecdebyiasterisks in 
the cover table (see Fig. 3.4). The corresponding columns 
are also called globally essential. 


A solution $* of a WCM cover table is a set of 
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WCM Cover Table 


MPCC's (or their subclasses) such that (1) “6* contains 
aliethe MOS in W*> and. (11) 4£ any of the MPCC’ sor 
their subclasses) in o* is omitted, at least one MO is 
noc aneluded., vA solution! is minimal it ther cost Bois 
oh neuania Oo” 
Intuitively, a solution ¢* signifies that each MPCC 
(or a subclass of an MPCC) in $* is representable by a 
Single encoded field in the microword. Clearly the best 


(minimal) solution will be that reguiring the least number 


OLvOLteSs to encode’ all the fields. 


(2) Note the analogy between a “solution" and a "cover" as 
deftinea in section 3.2. in fact covers can be derived 
in precisely the same manner as solutions are derived 


here. 
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If column i of a WCM cover table forms a proper 
subset of some other column j, then column i dominates 
column 3, (20), 

Consider a WCM cover table containing a globally 
essential column say i, and let the corresponding globally 
essential MPCC be he Then ve must appear in a solution 
o* (since oe is the only MPCC containing Hy). Lf column 
i dominates column j, the latter may be deleted from the 
table since hs is contained in the MPCC in column i. Simi- 
larly, a column being dominated by a non-essential column 
may also be deleted. Finally, if two or more columns are 
exactly identical all but one of these may be deleted. 

Given a WCM cover table, if its dominated columns 
are deleted, and the essential columns removed, the result- 
ing table is a reduced WCM cover table. 

For example, in Fig. 3.4, dy, dos 3 and 5 are 
Globally essential sco lumns 31), 82 pao wonOt mane senuce aysO 
globally essential. Removing (selecting) these columns 
and deleting all the columns dominated by these columns 
yields the reduced WCM cover table of Fig. 3.5. 

The solutions from a WCM cover table can be sys- 
tematicallystound using the procedure for finding the 
prime implicant covers of switching functions. Thus, 


from the reduced table of Fig. 3.5, the solutions 
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Reduced WCM Cover Table 


obtained are toy, biol {oar dor dais Loc, diols and 
oc, dg}. Combining the globally essential MPCC's with 


these, the complete solutions obtained are 
fe We ee Cae ee One {bir bor bar O47 br %gr og) 
{O41 P51 O31 cr 5, diols and {oy, O51 31 Por Po, dgi- 


A minimal solution is obtained from the set of 
solutions by means of the following procedure. 

Given a solution say or, a cover table (called the 
solution WCM cover table) is constructed with only those 
MPCG 8s Fin o1- in? this table, invaddition, to the qlobally 
essential MPCC's, some locally essential MPCC's may also 
be present. These are identified by asterisks above the 
corresponding (locally) distinguished MO's. 

Referring to the solution WCM cover table for the 


solution o> ={oz, Por 3, Par oo4 103 (Fig: Bie 8), ae can 


69 


7 wt 7 -. 
= 
inn ok 7 
7 
: _ 


; aut ow LT 


ettisT —reveo 
RS 


Bris sige? ‘\ad! toh vee wpe! oe +o 
cit iw 2° DD9M Lead sor ie tae a oh, ede 4 si 


ats nahbases anottu tor ‘atstqmoy.¢ 


i ‘ v9 got 7 alge? vy? I: \ 


ch gO aeP s5F ie? uy? eft beni vtgy? hd e ee ; 
( ry Dal” ’ 
‘30 362 ent moxi bentssdo ak rodsiboe par = - 7 
ee : 
soubsog'tt endwabtod sat on ron came OF 
a -_ . wv 


edt bat tno) eitded Yavoo 6° ae Ya2 mbcieboe 5. fev 7 


a 
eer] 


seons v tnd dt bw Sosaisaron 2s (oldsd 4 
o*, weg 
pishe ti) elds ate 13 
ow) 1s har 7 
—_ Nan 2 ‘200 t0988 o vile ary 


sh aac ak ios a os nab he we 
, a “as ah te co 


ALY 


a 


Miteddte att, Ot A) 


< 
a 


be seen that Yor Hoe Uy, can be 


igen 3A eee eee le A omen i) 
ieee, 3) GR eraT og Fay “ibioe *y 
ve 4 05) 
4 3 
? 10 
Fig. 3.6 


Solution WCM Cover Table 


covered by more than one MPCC. To find all possible ways 


of covering these MO's, a reduced solution WCM cover table 


containing Columns 7; L0;elLl as constructedsihige 327), 
and all the solutions from this are obtained as: {oye 31, 


(PO Nd Gy hero phon ten One Gu Oh KOE HIG 


Eel ilS 7. 


Reduced Solution WCM Cover Table 
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Te say, loge 102 is used to cover Wor yor Hye 
Clearly the appearance of these MO's can be deleted from 


all other MPCC's. This results in the solution 
{u,t, tus}, {u3zt, {usr Ugh, tug, vo, Hyote tugr Ugr Hyyz) 


whose cost, computed according;to (3.14), is 9. The pro- 
cedure is repeated for the other solutions obtained from 
the reduced solution cover table corresponding to or. 
Similarly, starting with o5. Dyed dor solutions can 

be obtained; the one giving the smallest value of B is 


the minimal solution. 


3.4 Conclusions 


The purpose of this chapter was to examine the 
nature of parallelism between MO's and its relationship 
to two basic design problems; the constructions of 
polyphase timing schemes, and minimally encoded micro- 
word organizations. These problems are pertinent in both 
microprogrammed (with ROM's) and microprogrammable (with 
WCM's) systems. I have considered here the latter problem, 
hence the stress on "potential" rather than “actual” 
parallelism in this chapter. 

Suppose a design process begins with maximizing 
potential parallelism using the procedure of Section, 3.2, 
and the number of phases obtained is k. If k is 


"acceptable" from the economic viewpoint, then the maximum 
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potential parallelism (say q*) is preserved by encoding 
each set of unit ‘equivalent MO's by a single field: in 
effect q* fields are obtained. Clearly, subsequent 
application of the word minimization procedure of section 
3.3 will be unnecessary since it will not reduce the 
microword length any further. On the other hand, if 

phase allocation is such that less than the maximum poten- 
tial parallelism is obtained (this will happen if k'<k 
phases are used) then microword minimization procedure may 
be effective. There is therefore, in this sense, a trade- 


off between potential parallelism and microword length. 
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CHAPTER IV 


PARALLELISM IN STRAIGHT LINE MICROPROGRAMS 


ay ine roduction 


In Chapter II, I have reviewed several algorithms 
that detect parallel micro-operations in SLM's. The 
principle conclusions were that the JD algorithm is dis- 
tinguishable as being the most general in its applicabi- 
lity to different host machine structures; it is’ also 
@usce efficient. The VST algorithm on the other Hand- 
produces an optimal (i.e. minimal) output for monophase 
microprograms, ignores timing considerations, and is 
asymptotically inefficient. The other two algorithms are 
inferior to these either in terms of generality or 
epeimality,. 

In the present chapter, the problem of optimizing 
parallelism in SUM'S 1s considered at a greater level of 
generality than has been done hithertofore. In particular, 
the idea of permuting the input sequence (SLM), a technique 
msed iby both Tsuchiya and Gonzalez [72] andyYauyvet al {77} 
though in a limited way, is explored and analysed more 
systematically within a polyphase framework. 

The concrete result of this analysis is a new, 
efficient, optimizing algorithm which is applicable to 


both monophase and polyphase systems. The algorithm 
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produces an output which, though n ot optimal, is the 


"smallest* in a more restricted sense. 


4.2 Basis for the Optimizing Algorithm 


A useful concept that needs be introduced at this 
point, is that of a (microprogrammable) machine state. 
This is defined as the outcome of an assignment of values 
to each distinct memory resource in the machine. Each 
memory resource can itself be regarded as a state variable 
taking values from a well-defined range. Thus, it also 
makes sense to talk of the state of a subset of memory 
elements. The overall machine state is then given by the 
ordered set of values assumed by the memory resources at 
that time. 

As a trivial example, if {M) »M,,M3,M,} is the.set 
of memory elements in a machine, then the set of values 
=e3, IM. -= 6, Ms = lj, M, =.0) -defines.a, state thatyeis 
distinct from the state defined by the values (My = 6, 
ee M, > ON) c 

Astate change 1S said to occur when there is a 
change in the values of any subset from the set of memory 
resources, One of the means of inducing or effecting a 
State change 1s through am event (see ssectionw 2s) cr 
what is equivalent, a micro-operation. Note that is not 
the-only agent "or a State change. “For example, in some 


microprogrammable machines, the contents of the control 
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memory address register is altered by a hardwired micro- 
sequencing unit. The state change effected in this case 
is not done through a micro-operation. 

A state-based definition. of parallelism in SLM's 


can now be given as follows: 


Definition 4.) 


betes be van SiMeand let H; < Ls ins.) Then Wy and 


u, are said to be locally parallel denoted We Me Use i 
for all initial machine states, the execution of a micro- 
instruction I = tuys uss produces the same final machine 
state as the sequential execution of T,=tu,}, I, = {us}. 
Note that this definition merely makes more precise, 

the concept of parallelism as being able to "place a pair 
of MO's in the same microinstruction", The term "locally 
parallel" is used here to distinguish the parallelism 
within SLM's from "global" parallelism - which I shall 
discuss in Chapter V. The conditions for WG lala Us are 


of course, given by the expression (2.7), i.e., 


tts 


; Ree Ms ee re ies Hs) ; (A'S 15) 
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Ap2ete Sironitrecance of Branch Micro-operations 
eo ee ee ee 


A micro-operation (MO) was defined in Section 2.1 
as simply a control signal originating in the control 
memory which causes some event to take place. Here, I 


shall further distinguish between functional MO's (FMO) 
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and branch MO's (BMO). 

BMO's represent conditional (two-way) or uncondi- 
tional branches. In the context of the 5-tuple represen- 
tation (2.4), it is assumed that SC denotes the set of 
arguments for the predicate defined by the (branch) OP, 
and SK designates the explicit destination of the BMO - 
the micro-operation to be executed next if the predicate 


is satisfied. For example, the notation 
< BHIGH, {Rl,R2); {ut}, U,V (452) 


mMayemean that. if RL >. R2) then, control] transfers, to Ups 
else the next sequential micro-operation is accessed. 
Notice that the explicit destination in (4.2) is an MO 
only because the microprogram is specified in canonical 
LOLMous iy Generating M1 CrOinsSLructions ,s tuLceexplicic 
destination has to be transformed into a control memory 
word, adadress,  Viz., the address of whicheversmicro= 
Inscicuct1on contains Ups The state change effected by 
the execution of a BMO therefore, is the assignment of a 
new value to the control memory address register only; 
no other memory resources are affected. An FMO is simply 
any MO other than a_ BMO. 

Recall that in an SLM S=<Uprlgrese rhe? the only 
entry and exit points are Uy and Hy respectively. This 
implies that u, can be a BMO provided that’ the*explicit 


destination of the branch is none of the MO's Uorecer lye 
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Furthermore if Wy is a BMO then we have the following 


obvious property: 


Lemma 4.1 


Let ee as for some pair of MO'’s Wa Uy 2m an 


SLM such that Wy niece SEG Alpe Ee)? i (Ue edenote micro- 


= 


instructions containing wy and u, respectively, then 


ts 
T(u,) cannot precede T(u;). 


Proot 

Since there is a Single entry point (u,) and a 
Single exit point (uy) in an SLM, evidently if any one 
MO is executed then so is every other MO in S. Let the 
execution of T(u,) precede that of T(us); on executing 
T(u,) ac the branch condition 15 Ssacistied; the next 
microinstruction to be executed is T(u,) where We is the 
SxpllcuerCeStinacvonumOnstie EMO me lnimelacEcase T(u,), and 
bnence u; May be bypassed, contradicting the earlier 
assertion. 0 

The signiiicance of this rather trivial lemma 
lies in that it serves to indicate that while designing 
a general optimizing algorithm, branch MO's must be 


treated as a special case. 


4.2.2 UInvertibility of Micro-operations 


To motivate the approach developed below, consider 


the short example sequence of Fig. 4.1. We see that 
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VCLy ie U5) and VCS il, U3), although the absence of 
parallelism in the two cases are for quite different 
reasons. If the Jackson-Dasgupta algorithm were to be 
applied to this example, we would obtain three micro- 
instructions T(ue), I(us), T(ug), and these would be 
executed in precisely this order. 

Notice however, that U5 and Hz can be interchanged 
or inverted in the sequence without affecting the final 
result (machine state). If us and Uz are inverted, we 
obtain the (state) equivalent SLM st = SHy U3? GOB Ns per: Beds be 
and since Wy OF Uzy only two microinstructions are 
required, viz., T(H,,H3) followed by T(u,). Since there 


are no other possible permutations we have in fact, 


obtained the minimal set of microinstructions. 


u, = < GATE, Bids -yent Bans 3 Bee: 
M4 = < ADD, 12.344, 4439") (ADDERS Feit > 
u, = < ADD, (305) ,. 257) 1ADDERS ab > 
ae Pe 
An Example SLM : 8] 
H, = < GATE, bs Ga aera eB Peed Perc 
u, = < ADD, 13,51.,,.15), (ADDER! , .11 > 
uy = < ADD, i233 {4}, {ADDER}, Ill > 


Fig. 4.2 
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An Example SLM :. So 


He Y ys) se ID) 
Uy, ? Cee TAs sb 
Bagwa44 
S5 : An Inverted Version of S9 


The reason that U5 and U3 can be inverted is of 
course, «the fact that) they employed disjoint sources 
and sinks. Ignoring for the present, the operational unit 
and time-validity components, consider the sequence S5 
(Fig. 4.3). The point is, can we legitimately invert these 
MO's? 

Assuming that the memory elements are all 4-bit 
registers, and that states are represented as binary 
strings, suppose the initial states of A,B,D are respec— 
Lavell ye O00 me Od uma ms LOO tm One exeCur ing Sor the 
relevant final states are CGC = "0000" and B — "1100": 

If this sequence is now inverted, S5 is obtained 
(Pig. 4.4). Given the same initial state notice that the 
final states are still C = "0000" and B = "1100", in 


spite of the use of the common resource B! 
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This is quite obviously due to the fact that A's 
initial state happened to be "0000". If it could be 


* 
2 and S5 lead 


to identical final states then only would an ae prior. 


Guaranteed ethatiyior tall initial states, Ss 


inversion of Hy and Us be possible 

In this analysis, it will be assumed that the 
microprogrammable processors are such that for any pair 
of MO's sharing data resources no such guarantee is 
possible. Or to state this more precisely, it is assumed 
that. for any pair-of MO"s Wan be that share data resources 
there exists at least one machine state w such that the 
execution of the sequences Sura and Sa with wy as the 
initial state, lead to distinct final machine states. 


The notion of invertibility is then made precise by the 


following: 


DeLinieLron 422 


Let S be an SLM and Wir, be ines sen lien Hy hs are 


said to be invertible, denoted WAM ie us B Us. 

One should note the distinction between the ) and 
e@frelations. The relation Hy Bus depends only on the 
details of the two MO"s, whereas Wp AU, depends in addition, 
on the appearance of Uy and ue within an SLM. Like the 8 
relation however, \X is symmetric. 


Using Def. 4.2 in conjunction with the expression 


(4.1) leads to: 


‘ : , .— 1 7 7 
- : - 2 Ms x 
o . _ ms : 7 Oy Bara 
‘j ' : 7 ii | 
re r - a afel 
atk gerd aoe Sli ccad gealiate auc mph 
7 7 el ; ot 
ed’ Bien 3% <2 = * ‘ 
, a ng 
® a 
Boel -& BAB «oc ye 
: Sh ke 


nb 


if 
0) 


38) 
- 

z 2 aeaa 
eon a ane 


isefrg é 16 Bivew oe Bagas ea7K7e . che 
oan , 
4 


“aidtares Sd aif tun yi 3 


ais ters bonne ae 3 che nine airs ar 


as 
ried YS 102 ee eis: Bun azOC2e250073qG — Sima poxdors 
og GA eer : 
ek oo3n576 @ Hove on asomsces sche pms ted 


. 
¥ 


a a PRO % 
henweas ef, Ff eiokiaeaee enon “att gtaTe) OF ‘xe ol digeoq 


? 


(& - 
a = Vs _ we 
esgsiyseor 6ieb sitsda geds +4 2"0OM to U8y uns a basi 
- - Py * - fatne 
sit teat foun | Sopaeeatinvaert Sho fessf & (8 9% 5 
y i . », » i . 


t 
9d 26  htiw 2M ais bre bt iad * esnnadpaa ang to ‘sortusi xs 
s vi 


f 5 => & < r; 7 = mk 
2etecea soidosm tanta 7 pultert od best an -. oe 


if 


aft yd eekoastq aaa riots, ae =y tratjasvab 19 ah 


928 Mv iM nent .a ni Sed pings ble ite hs 

“44 a; TE pls ai betonek, void 

baw f ond neauted consslaais ttt. seam one ie 6a 

af ert rte) vito shaoqeh AV aod pobgnsan nt zoey L 
ee: ak abipeat iy fh ae shrsse ie a 


mg Ba 


a " - . 
saan a : 


' 


; at 


81 


Theorem 4,1 


Let S be an SLM, u, < us and % (u, ie us). Then 


Hs ue ifvand only ir 
(Vion v7) A (ya, 8 Hs) A (U,n aie >) 


Proof 


Assume that Us Sete cate Ge 


, Pell uj), and uy Au 


qi 
Then HW Be Furthermore, under the above assuptions, 


Ne ee must also be true. (“For otherwise, vice, Lt 


Veni Ve = oO, .then either V. “Vv. ior V2>V.. sholdas.. oBut 
if i nk 5 uh 5 


ese implies, by Detinitions 2.1 (11), 222, and the 


expression (4.1) that Us Pilla get a contradiction. 


i a a he 
Similar Ly Wap 2 ase (u; B ha! implies by Def. 2.1(11) and 


the expression (4.1), that u, ble Use again a contradic- 
elon, se hs Veen ve # >. Assume now, that U, U5 = >. 
Then Oe n ve # o) A (uy ota he n u: = ¢) implies 


U hep Mas contradicting the assumption, hence Oe nO re d. 
The converse is trivially true since Ws pole and 


Us Bus means, by definition, that Dee nee O 


Suppose that in a given SLM, Ws site and HA Wai 


then the ordering HW; < us is said to be the specified 
ordering. As a result of the i relation, we may change 
the ordering from the specified one. The particular 


condition vs (leave) ar (u; us) will be denoted by the 


Lig 


relation u,; ee Hs That is, u,; Ne Us represents: the tact 


that a pair of non-parallel MO'’s may be inverted. For 
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example; sreferning ttoeFiGg a4. 1, Uy Ast U3 is~true:, 


J 
ina MECnOinStruction ~asa ya beat lous 16 = {UprHoreser ty, 


Recall that if u,; ioe ls, then Lies can be placed 


then for all pairs Hits ee Ae Wy 7 thateis, Deforms 
a Parallel set.» An ordering <.on a pair-of microinstxruc-— 
tions eon is defined such that if I.< LT; .then I; is 
executed before Bac Thus, given an ordered sequence 

I,< I<... <I,, it makes sense to speak of an "earlier" 
Oreelaler MECroMmecructvone  TOrsconvenlence sale ohialt 
order microinstructions on their induces, si" es. “<a 
implies I.< ita Furthermore, as in Lemma 4.1, the nota- 
tion Na will denote a microinstruction containing Hy 
Panally, rererring tor Dery 2.1, “the expressions Ui), + (41) 
and (iii) of this definition will be distinguished by the 
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relations wy O4 idan Ws 


Li and Wy 63 te respectively. 


Theorem 4.2 


let). sbe “ammicroinstructioen containing MOsseLzom 


) 
SOreadkL Wy eBay and (ii) us is not already in hae Then 


an SLM S, and Ls an MOSinsoesucna that (1) Ha < Us ees) 


* 
(a) if, for all qe Lae (ui; 6 Hs A SK, SK. = hve (ned M5) 
then some rus) can precede Wes 
(b) L£ there exists some ee such. that yi. te aes 
ee elle al & ecede I(u.). 
Ce r us) then q MUSE. pire (ls 
(e) If there exists some uU,€ Lo such that Us Y Use and 


fom ale We Ty~ {uy} Uy ae Ms then the earliest 


microinstruction for He is Io: 


races cam ; ae ae anes nae on 
sh gty wenn gs cH! = rat ae et cm \ a une 
at jad - Earby x4 Ae | 
SURG? | Fs Paes ic gt aaa yer 
-gursedioxoim To per ee prt 


et 1 aer3 ee pte ae, oe swipe ae 


fisnha I ‘aidan sae «neiae ; ; | 


t > ok esl aa0ibnt tied3 ito paereriiela 4 


a 
-sdon'oie ,f.) omist mi Sx yetomaaderi™ ope > gh maaan? | 


“¢il otlaissnes od qadre bad bo 35 tr a 9famed | ibe (yt oes 
(22) \ (i)? eeatagemgre oft 4f.S , 26m oF pahaxetes axtlasts 
ads. yd hone trant ge An att obiw aotaiaiden pai 29 (242) Bes 


.yiav i doequen a 5° ¢ bi veh g@ ry ae ° jy anetaelex 


mozt #'OM pridistaes! beieickecthel nt fd ra 
fala ie ae ae sheeai 
ce aS (bi) bo. Sie tok 

Pal *% es = Ea . | ete 


By AY 
ao teal i pie nee yaks 
7 oa eae) e tteakd | L 
: ives i t ry sires Pe o - z. : 7 * 
< £ Pat yd 7 J foa'e ; fe " — * 
Sa 


y silt sony u 


(d) If there exists some Hi € ue such that (Hs 65 Hee 


ok. = 
3 os Aid) peanadnrorvalt Wy € se tus, Wwe Use 


then the earliest microinstruction for Ws is et 


(e) Nee ue is a BMO then Hs can be placed in TG if and 


only= if fontal lei. se less bes M and there exists 


gq 
no other microinstruction T, such that on 


Proof 


(a) Hor some Ue 6 fet Mee Wes then M.wcanmalwayvs 
et gq et us Us ¥ 


precede Uae On the other hand if (ui; 6 Wal A (SK; 8h, 59) 


then Mee are data independent since by Def. 2.1, U9 LL. 


J 
implies. ee as and We Madd MoUs ME = >?) means that 
Us a Thus Us and He can be placed in the same micro- 
iWSCEUCCLOn gO. .One ‘Can precede thesothner.. alia tom val. 


MO's in I_ one of the above conditions holds, then we) 
can precede te for some Ey o 

i i ea ; 
(b) VCH oe us) implies that either T(u;) Ts) or 
vice versa. But mus r* tel implies that the specified 


ordering must be preserved from which the statement 


follows. 


(c) Let we be partitioned into {Tony} Suchy that, cor 


‘ : : — Uidel ; b EE d 
all tne Se Uy Ee Ue and Wi Y Ys en us can be place 


i a Mau hence by Def. 4.2 
in Io: But Wy Ys implies (Hy aus), y A 


the specified ordering Wi < Ue cannot be changed. Thus 


T(u) cannot precede Te = T(u;). 
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beitisoge ety tally wedtgmi tjy-*« sin 41 eerev ely ~ 
 dinemetat=. Git odes’ bayrwessa od oo 
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(d) As above, partition Ty into {Toru} such that £0r 


all u, ell, Wy PR Ha, and (yu, $4 ua) A (SK; 9 SK, AO) 


j J 
Then obviously We can be placed in he Since ra remains 
a parallel set of MO's. But SK, 9 SK, # > implies (u, B us) 
hence Ry iE Somthac le cannot precede Hye Thus Tus) 
cannot precede T(u;). 
(e) This statement follows trivially from the defini- 
tion of a microinstruction and Lemma 4.1. OU 
The reader will probably understand better, the 
above theorem, and its use in constructing the parallelism- 
detection algorithm (to be described below), with an 
example. 
Consider an SLM § =<Uyrlgrese rly? from which a 


microinstruction Iy = {Uprese rth has already been cons- 


tructed, where 


Die CATE EER 1Aly 90 een 
M5 = * GALE, ROT Bey go Nee 
U3 = < ADD, (AAG ee a ADDER I, > (4.3) 
Wi, = < GATE, Crepe oak, Mey liy 
us = < GATE, LCjaF, HRAG7 Yen hy lege 
and I, < 1, < 1. The remaining MO's in S are giyen by 


(ya de wotleart pe par 
(ME pitt, ‘pf eheparg FoniBo ssi? ons 
*, yar ep 
is 

~ink%eb ‘ett nNoxd vitsivias awolioz sigmetiade eld? wo 
a -I,) samod ros nob aestentoxe te 6 tof 
six. , 103 7ot baade cab Yldedexg iLliw rabeas ony . 

Ey paeah ty 
-matiolipzsy ant pakjourtenae tt oe eth bre. ve Ps 
| 16 Haitw . (woled bedirsess se om) aiid nokoeteb © 7 
/ \ 7 i F ; ae 
. ° 

6 doitw mor Kp pln ee etgly pore e ites me xsblenoD 7 
-pfos nesd ybsetls esd Agung sen ph?! * 5 t ae _ 


“ 


< tee ee ish > = ten ¥ 7 
Sls _ fa} (SH) STAR a zit bas a 
(E.0) 2 cl, iwaaaAt {0} {'ay8) aga >= ee. 
< phy a tia : 2, ct) > we 
-) em ier ieee J i ie | " a 


es 


‘oD 


a Chee hee ee aa | 1, > 
Wy =< GATE, VR SD eee , 1, > 
Hee => SHL, { A } {D} , {SHIFTER}, Il, > (4.4) 
POM TRE CE rem ts Peeebn te) 2 sonenel ae 
Wij = UBHIGH, {R37R5), {"y,"}, {MSEQR], I, > 
We may then make the following observations: 
eo Boreal! Ws € ae us 6 He and SK, SKe = 9; hence by 


statement (a) of Theorem 4.2, Ue can precede et 


1 * . 
L2i U4 cannot precede Us since ay r Uo) i also, U4 


cannot be placed in Ty since v(uy ies Wo); hence, by 


statement (b) he SMG aie 
ey Since uw, Yug, and for all other u,¢ Thy he Ugr 
Hg can be placed in Wee But since V(U, oUg)s Ug cannot 
precede ee Thus, by statement (c), the earliest micro- 
FWeehere Kohenloyey eye Ug is se 
[4] Since u, 64 Wg, and for all other u, € Tyr Uy Ue Ugs 
Ug can be placed in a: But again, Ug cannot precede Ty 


Since V(U5 Bug) - Hence by statement (d), the earliest 


MLCYOLNSErUCLLON, LOG Ug is toe 
[5] Finally. noLresthat since NA Bie Hig): and Hig is 


a BMO, 2g must precede Tuy) - from statement (e). 


aor) The Optimizing Algorithm 
The algorithm can now be presented. This algorithm 


uses three pointer variables as follows: "i" is a pointer 


(b>) 
tad 
7 eles Lina) (aa et a 


ae 


eo. sii sateat wa, Ne 


ee fhe 30% 


“pt 9b4987q ASD gi Set ioe to (a) 


| 
ae 
a] 
-_ 


vd someon <> = a8 i2 Bae ay ? ‘pS 


cH ,oels veh Wh ant sont : APSIase sous + 


JOANBS U4 igh A iv Madison “we, nt mits soonfa ed 

-ornim teotlxse org (3). Snembsae yd, aurel't 

- sa eh at 192 shee ~ 

_ . 

vos at 74 tot 344 <atto [fs 102 Hae gl @ gt a ma 

pi Bheseag. Joanso% oH (mises due, |. ot at baoetq “ 

sestites art \(b) Sremetese Yd fae ane eee - 

“pt et es 108 7 

ek g,¥ bas ota gi igh su) (poke 2 seit Roope. 


+(s) rena. | “ t agree 


to the microinstruction "currently" being examined; 
"i1*" points to the latest microinstruction in the 
ordered sequence of microinstructions generated at any 
Given time; and “j7 points to an element of the input 
SLM. The expression "branch (n)" denotes a predicate 
Which 1s TRUE ifn isa branch) micro-operation, and is 


FALSE otherwise. 


Algorithm 4.1: Detection of Parallel Micro-operations in 
an SLM. 
Dipive:) | An >LM” Ss = <Uypr gree er lyre 


Output: An ordered sequence of microinstructions 


I = <Ijt,,---/1,?, Toe ae tore 
[2s] fe oe Rc oh SEE : 
| 1B he {uy}; 
[3] Iban wierl > esceiR 


If j >t then I+{1,,1.,,-.-,1,,}% STOP. 


[4] if branch (us) then 
begin 
ifull, uv, * ved; 
[4a] Ehen i 150 {us} 
else 
begin 
[4b] Tiyy * fushi eee tg eee ae 


end 


Goton ls 


end 


ey ery 
- ‘ 


ai brs , notte Bees sss : oti : + ae 

Sf enol deveqo-cae: i Xo soideonde Li sata ea 
7 “Male Te 5 

gree each il = a Ma: aa 33 

gnotdou7sentoxoen (to ‘Spr hetero aAy 

carbs. As ee eng gh? hala | 


wy oot ee tea a7 
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FOTR 5 Cy ply ee F egh i eatiun 
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a 
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[5] aoe oer ene ko Gia) 
then 
begin 
shee Wi Geedaey als ake 
Ui Me 8 
goto [3] 


end 


i] i] 
[6] Tf (dueTysuyus) Au Ae uy * ute Tytuh) 
then 
begin 


Be ees I tus ti 


Gotouls| 
end 
' j x 
[7] If @ueT, 3 usu, ASkn SK, A$) A Cu ||, use wet -ti}) 
then 
begin 


eee tus}; 


goto 131 
end 
[8] While [(u OTe BSC SK, = OE VGH was i € T,] A [i> 0] 
do begin 


Sap ok n Ske =o) «ed. 

if (uv ou pel Sa Leng OE os 
thenwang + 

he ah coal 


end 


ee 


7 a 7 16). ¥ 
7 : - a aa 
v8 : » Or _ 
| ‘i 


er va 
wae 7 

(e] op oh aly 

pag, 7 

=, | os ir ye tes : =r ia 

(ik 2h DY Aeon ve eee B) x Aca 

aged 

ote | pias - 7 

. [] onde 


ery 


Die, 
(ib < ft ¥ | *u) A (> a -e Ne Re’! uf 4 4 f = Tey 2 20 
E : a C ve roe ; 
’ . & a5 


at a ioe 


éy ere Q 
bre: 
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4 
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(0-32) A [,2 5. teu Kuh) Vo (a =| 48 Ameen bik ont 


saw we “ 


bmn gt oe 


> ~s 
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i ‘Wa 


[9] If i = 0 then 


[9a] oe ee eng ea iy 
else 
[9b] begin 
\egoed nis 
while k>0 do 
ease aa 
he kee] 
end 
Thy + {uy hi 
de a ea lL 
end 
oc} ie 
goto! 31 
end 
[10] While aueT,a~(ull, uj) do 
phe Ghee he 
if i>i' then 
begin 
[10a] I, <«tust; i* ea; 
goto #3) 
end 


end 


6 
! . 


ob (iu (yl! Ho tae RAR (oxy 
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eon SOA Leg l tyr 


JOcor io). 0 


Verification of the algorithm proceeds by induc- 
ti0m On b(S)i, the slengtaisct the winput WsiM eS. OMT shalt 
hicseeshow*that for any iparti tion Ls in the output set 


Hig. Hart, € 4s Satisfy us That is, each Le obtained 


| iy Py 
is"indeed “a microinstruction.’ 1 shall “also show ‘that the 
output satisfies the necessary precedence constraints 


imposed by Theorem 4.2. 


Theorem 4.3 


Let I = {I} ,Inr---,T,} be the output produced by 
Algorithm 4.1. Then 
(a) Moje bil Ha ry € I, iy ele Us Me Uy 
. * 
(b) Tf us < Hu, in S, and wus Me Wy) AwGs d* y,) then 


T(us) <I(u,) bat aU 


Proof 

Fimst note that at thesstart of each iteration (1.e., 
whenever Step [3] is entered), i = i1* denotes the index 
Omthe "latest" partition generated. “This ttsell can be 
proved by induction on the number of times Step [3]/ as 
entered. FOr, lt as Certainly true the first time che 
ScCepeiseeentercd since bY oteps slll7. [21,025 te ana 
only one microinstruction Ty exists. 

Assume that this is true just before the m-th itera- 


Elonmore Step [3]; @and let p= 1% = neat this stage. Then 


eB 


—s0bat yd sonooaig mighioete or 


{4 to ais 29V wv 


alee 
{fete tT .2 Mle stig sds 20 dipnel att apa ‘no noks 


joa tugio oft mi et aoititzeq 1s) 108 sod * de tl 
padivido ot dose, et deri? = il tll 4 Yatra | Pt: ah oi at 
ai+ teas wote oefe Lfsda EB 110: towitenieraen B be pak @ ak 


Sinis7tsetoo eanehaobag yrbaascen saat coiiebtee 3 . wo 
h moxoont hale por a 


ree moxo - 


yd beouboxrg Jvgano ait ed! pTasewagly pe? = I s0 : = 
| . adit yd Lb ws xoplaA 
seul || ll sh itt ee ae -y ifs + ot os) 
GSAS \ oh x uw A bag + ) eH) Dele oe ‘Its 4a es a ) 
R nt cys (. or i 
| at 8 a 


~9-I) nottevets doss 30 txiste° 903 46 Isiy, sfon am ca 


$3 2 iv as iw 


xehoi ots astonab *b = 2 \ (boxedga’ ai | ‘Tey ge 
ed mio Iisa. tT Sete -batetbre abso Etanar nic: 
ei [t] yore eomts 46 Sidr ois no! Hod: 
sd2 arts, taxi? as sivas are, a wa £ aon i. 


ba’ t= SE = “a iste fae v2 ¢ 
. ite 


(i) Sehesonlyesteps tin *whichei is Sincremented, tareesteps 
MD, stl, cor [l0alipranddin each "of these isteps pra 
Platestiemicroimstruction cr Te+l is created and i* 
made equal Itoi; ((is)einssSteps (4a, hice = a7 itor | my, 


I, = T, remains the latest microinstruction and i* remains 
uncHanged Baty npigand (GN jaein eStep PO }l) telther +i band S22 
remain unchanged at the value n, and no new microinstruc- 
tion is created (Steps [9a,9c]), or a new "latest" micro- 
instruction Tel is constructed and i,i* both made equal 
to ntl e(Steps  [9b, 9eljyer 8fhus, ak thetbecinnimg tof the 
(m+1)-th iteration of Step [3], i = i* denotes the index 
of the latest partition constructed thus far. 

To prove the two statements of the above theorem, 
Gdenocerbyin(S)), «the lengthtof the: input; S. SiFor G(s) =72Z, 
S = <UyUy> (say) am Then; iby step. bil, hc {uz}. The only 
steps by which U5 is placed in Tyr are [4a] fs ulGle ive and 
[11], and in all these cases, i, he U5 is satisfied. 
Hence, statement (a) of Theorem 4.3 is proved. If how- 
ever, “(Uy aes U5) A v(u, A* HZ) then either Steps [4b] or 
[Stwusfenterca, andwin eCither of cheese, Uo is placed in 


a This proves statement (b). 


pe 
Suppose as the induction hypothesis, that the 
theorem is true for a length n-l, and consider Uae the 


n-th MO. Without loss of generality, denote the current 


Setaiom partitions by 


ifee = {I,,I5,---/1,} 5 
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OF 


It is easy to see that proposition (a) holds since the 
only conditions under which We is placed in an existing 


Patreubionel. are whensftornsall uve Ihe Lalas u, (Steps [4a], 


k 


LOlee (7) (9a) 79 l1t) eins the "remainingscaces,.a new 
partition is created for a (Steps) [4615 (5ie1ebl> LL. 


Thus, in the case of an existing partition I all the 


he 


MO's remain pairwise parallel. 
Consider the second proposition. If Hn is aseMO, 


then by Step [4], Un is placed either in I, or in Teas 


TE ae) Sy Ou eon I, then any Us in S satisfying (Wa = ey at 


wus tle u Wil eiOteDemiuel a) Hence T(u;) < Ce) 


Pe 
since I; 1s the Latest microinstruction. Df there does 


exist a SD stsy dhs Risteley Geetcke te A % 
us ; (us < uy) Cu, | | 


will be placed in I; 


L wo then oe 


41! 1.66; mo) <Tiu) since I< Tia 
by assumption. In either case then, statement (b) is 
satisfied since, if ile is a BMO and Wa < Uy in S, then 
wu A u,) Imo Lil vanoLase 

Let be be an FMO, and let the condition of Step 


[5] be satisfied. Then Un is placed in a new partition 


Tiaae 


and T(u.) <I(u,) for all Us < uw, in Se 

TE the condition of Step [5] is not Satisfied, 
lava seoren iii wy Bere WAN u, or ud* uy. If now, the 
Condition of Step [6|)is satisfied, then fom all aise Ij, 
at Hee anda (1) s If there exists any My in 
S,/ such that ~(u [ly ea ~w (us * H,) then Ua ¢ I,. Thus 


T (us) < T(un)- 
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peecthe iconditionton Step [6 \sdoesimotehold;ethen 
(u A* un) Vv Cu 6 Wy? holds {£0 mea ll wel,. Tf the condi- 


troncorestep [1] ais mow tsatvsfied, T(u,) = I and as 


ail 
above ats T(u,). Otherwise the next step, [8] is 
entered Vin whichicase.thercondition. (wy os u,,) V 
Crh ) A (SK n SK = ¢)] is true. An exit from Step 


[8] is obtained when either of the following conditions 


is satisfied: 


[a] =O) A PCr 26 uA SK n SK= d)V (yu A* WL) £OxV 


aulak. “Wy 4s I for all Ine pas nee le 


[b] in = A Gor some hi sacistvying 1 <ihn = 1*— — such 
Shae Looe HA) Vv Ge 6 WA SK n SK, Ao)J A 


[v(u heap is for some u € I,. 


Condition [a] leads to two possibilities, viz: 


[al] For all a (Oe = ae po) ana BtOtea Lumens eT, waA® Un 


tse truer, ein that case a — 0, Se that by Step 195) He is 
placed alone in T\- Moreover, since condition [a] is 
satisfied, there exists no Ba Ss in S such that 

* thataolacan aie 
~(u. [|], uj) AG, A* uy) holds, so Pp Spite 


qT) does not violate statement (b) of the Theorem. 


[a2] There exists some he Gis Sp Ss -* je suchehac ac or 
ral ipl he els u(y A* ty bute (uw 6 WH, ASK SK, = o) is true. 
By otep [8], the variable “a” points to the Searliest™ 


microinstruction satisfying this condition. Since a#0, 


_ as i ie 
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ue is placed in I, by Step [9a], Furthermore, since 
condition [a] above is still satisfied, we see that 
proposition (b) of the theorem is not violated. 

UnGer the condition ib) vabove, «since 1.7. Ue 
Step [10] is executed. Note that this condition means 


either 


cz) Vu Ovid, Ae r* WY) for some uel or 


h? 


(ii) (u 6 WH) A (SK n SK, # 9) A VC As un). 


In the case of (i), though v(i6 Wn)s it is possible 
that eae. in which case, although oie Un is* true, SO 
hoy ete Ge Au). If neither Hou, nor uyu, are true, then 


v(u 


; ps . ee ; 
ne We and since v(u Wie this implies v(ud HL) 


So in any case, Ty is the earliest possible microins- 
BrucL One Lor, Ae 

epbupilbcwallatay, velsee UGlakjy) “stewwed= 2) n Sk, # > imply 
U(uA HA) Again, the earliest possible microinstruction 
D ahe a is The 


Thus, if there exists some ae Ly in S.such that 


(1 A (Us) AS then I(v.) < I, since otherwise 
Oe lane) (He u) a h 


the earliest possible microinstruction would have been 


some I,, * I); Le Th) = I,, Step [10] ensures that Hu, 


is eplaced. ain a Jater microinstructilon, so; that ie I(u,). 


oie T(u.) < ip then, by Step (101, aiid = Tyr, so that 


ne n 
again, ECU I(u), thereby satisfying proposition (b) 


of the theorem. Q 
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The second part of the verification is concerned 


with the minimality of the output. 


Theorem 4.4 


For any input SIM S, let I = {IyrTore-e eT} be 
the output produced by Algorithm 4,1. ‘Then 1°1s such that 
there exists in I, (255 ji oot) “at. least one MoO which ‘cannot 


be placed in an earlier microinstruction. 


Proof 


BY Sinductionfonstheslengthel(S)wot the wnput- 
LOE t>(S) =sl,2, the proof is trivial. --Suppese the Ftheorem 
Vs true tor SLM's of length n—1, and let the output be 
denoted by 


nS {I,,1 


The assumption of minimality means that there exists 


w= 2c eae Least Ones palr OF mtcro- 


operations tm el., a eliay such that the execution 


i <9; : > 
a ee eal d 


OL ut and oot can never take place in the same micro- 
instruction; and that I is the smallest set of micro- 
instructions satisfying all precedence requirements: 

Considering the n-th MO Hj , we can immediately 
see that the cardinality |I| of I can never be made less 
than r because of the induction hypothesis. Thus the 
minimum possible value of |I| is either r or r+l. 

In the special case where Hy is a BMO, then by 


Theorem 4.2, |I| = r if all vel,, |i, Wye otherwise 
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|r] = r+l. These are the minimal possible values of |I 


That these values are indeed obtained by Step [4] is 
easily seen. Suppose ue is not a BMO. Then the only 
steps where an additional microinstruction is created 
ares ([5j, ([9bleandssll0leerin allyother cases Hy is placed 
in an existing microinstruction so that |1I] remains r. 
Etsis thus sufficient to show that under the conditions 
téadi ng lo additional microinetructions, Wy must be 


placed in a new partition. 


CL] The condition of Step [5] requires (by Theorem 


Z : 
Ae paella I. ee hence i must be placed in Tate 


p24) im Step, (9b)i,,-condition,) [all vinethesproot of 
Theorem 4.3 is satisfied; i1.e., for all I; TT ep ley oll ALe Pe 
fOr at et =n I. Hence Vee cannot be placed in an exist- 


ing smacroinstruction, SO a NeW DarteLt1on must. bercreated 


ror Moe 


[3] trmoctep.: (U0); ethos cOnd Le Onl dine Des Oo ae pe 
is satisfied then Ua cannot be placed in I;. LiechaLs 
condition is satisfied for all I, cae ie Hn has to be placed 
(as indeed it is by the algorithm) in a new partition ‘so 


theta) = +k. — Thasecomaletes the proot or the. theoren: 


4.3:1 An Example 


To demonstrate the application of the algorithm, 


I shall use the hypothetical SLM specified below. 
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Hypothetical Input to Algorithm 4.1 with Ty < I, 


Construction of the output see of microinstruc-— 


tions by the algorithm, 1s demonstrated by the sequence 


Gt partition sets that is progressively obtarned 


(UEAIGig ot Bae Ve 


In contrast, consider the application 


of the non-optimizing JD algorithm to the same example. 


The microinstruction set obtained in this case is given 


ae Gh igen ds. 


is required. 
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Construction of the Microinstruction Set for 
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the Example of Fig. 4.5 


Output of the JD Algorithm for the Input of 
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AV 4. Conclusions 


As I have remarked earlier, Algorithm 4.1 is of 
somewhat greater generality than the optimizing algorithms 
of, Psuchivaland*Gonzalez@|/2)}eortYau et almiv7), since it 
is applicable to the more problematic case of polyphase 
mMicroprograms. The proposed algorithm is then essen- 
tially an optimizing version of the Jackson-Dasgupta 
algorithm [139]. 

Consider now, the computational (time) complexity 
OL Algorithm 4.1. Using the number of comparisons 
between pairs of micro-operations as a measure of this 


complexity, we obtain the following result: 


Theorem 4.5 


Algorithm 4.1 requires ie) comparisons where n 


is the size of the input SLM. 


Proof 


Consider the k-th MO Hy fORE2 ee OL Uys 
one and only one of the following step sequences will 
Bemexecuced: (4) [Sis PG) yl? Nie lS leo ee eel Ols 
Orme, [LOl,, bul. Lt usseasityescens that: thetongest 
case corresponds to the step sequence [8], [10] since 
complete backtracking may be involved here. 


Suppose the partitions already obtained are: 
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with the arrow indicating the "direction" of comparisons 
in Step [8]. There are k-] MO"Ss in these partitions. 

In the worst case then, on exiting from Step [8], i= 1, 
so that Wy has already been compared with these k-1l 
MO"s. In Step [10], the "direction" of comparisons is 


reversed: 


The worst case will then occur if Hy cannot be 
placed an any of the existing partitions, in “which ‘case 
Step [10] causes Hy to be compared a further (k-1) times 
while backtracking. 

In the worst case then, Uy reguires a total of 
2(k-1) comparisons, and if this happens for each of the 
MO's UsrUgreserne (the worst possible case), the total 


number of comparisons is 
vere (= 1) oe 


Thus, the time complexity using this particular measure 
is O(n’). o 
Complexity-wise then, Algorithm 4.1 is of the 
same order as the JD algorithm since the latter requires 
Wee) comparisons of MO pairs ain jorder to construct the 
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CHAPTER V 


PARALLELISM IN LOOP-FREE MICROPROGRAMS 


ote Introduction 


Consider the canonical microprogram shown in 
Fig. 5.0. Tf Algorithm 4.1 is applied separately ‘te 
each of the straight-line segments (demarcated here by 
dashed lines), then a total of 9 microinstructions is 
obtained (Fig. 5.2). However, one may easily observe 
that Ug can be executed along with Wy and Us and Hio 
with U3 without changing the final machine state; by 
doing so, the resulting number of microinstructions 
PeEGuUCes. to 7 a (Eide. 5 .13)\< 

This example illustrates how a more "global" 
analysis of the input canonical microprogram may often 
yield a better output than (local) analysis of the 
straight-line components alone. 

I shall refer to the phenomenon wherein MO's not 
necessarily belonging to the same SLM are placeable in 
the same microinstruction as global parallelism. Thus 
parallelism within an SLM is a special (local) case of 
global parallelism. The aim of the present chapter is 
to develop a partial theory of global parallelism in 
microprograms, and by applying this theory, extend 


Pigorithm 4.1 to the more general case of loop-free 
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microprograms. 


the concept, of code motion as tllustrated 


The basic idea behind the 
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The use of code motion transformation in program 
optimization is well known [4,5]. The usual objective 
there is to remove some invariant piece of code from 
within a loop so as to reduce the number of times that 
the code segment is executed. In the present context, 
code motion will be utilised only to enable (if possible) 
better compaction of MO’sS, i.e. to generate as few micro— 
instructions as possible. 

One must note however, that analysing a micro- 
program for global parallelism may Often prove to be 


fruitless: the resulting output may be as large as is 
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produced by purely local analysis. Indeed, such improve- 
ments as demonstrated by the example of Fig. 5.1 would 
have been unnecessary had the original microprogram been 
manually optimized by (the programmer) noticing for 
instance that Ugr Ho could have been part of the first 
rather than the last SLM. 

Against this observation I offer the argument that 
the whole objective of automatic optimization is to permit 
the programmer to concentrate on the problem of micro- 
program correctness tather than on efficiency. 1f& the 
code segment of Fig. 5.1 us “correct”, the microprogrmammer'’s 
task is done. It is up to the optimizer (or more generally, 
the compiler) to transform and if possible, improve the 
code. 

Global analysis then, offers a possible strategy 
for microprogram optimization. In some cases it will 
yield better code than can be produced by purely local 
analysis (as in the case of Fig. 5.1); in other cases 
there will be no improvement, as for instance, for the 
segment shown in Fig. 5.4.) | ther choice of using or ireject— 
ing global analysis as a means of optimization is an 
implementation decision. I should point out however, 
thatbethe algorithms presented in this chapter are ‘such 
that thesoutput, produced Wills centainily, be, no worse. whar 
that, produced, by, local, analysis. sihence, the real price 


to be paid is greater compilation/optimization time. 
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This aspect will be discussed further below. 
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Manually Transformed Version of Microcode Segment 
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The general problem of detecting parallel tasks in 
branch containing task streams, have been studied pre- 


viously by other authors [44,69]. Kuck et al [44] were 
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concerned with the analysis of FORTRAN-type statements 
including DO-loops. Tjaden and Flynn [69] used transi- 
tion matrices for the dynamic detection of concurrency 
in instruction streams. They pointed out that even if 
conditional branches are present in the instruction 
stream, it is still possible to identify segments which 
would always execute regardless of the branch decision. 
This particular concept forms the basis for the present 


analysis. 


5.2 Microprogram Flowgraphs and Symmetric Pairs 


A canonical microprogram S can be transformed into 
a set of SLM's together with a specification of the pre- 
cedence relationships between the SLM's, using the 
method proposed by Ramamoorthy and Gonzales [53]. More 
precisely, it is assumed that the canonical microprogram 


is in the form of a flowgraph defined as follows [2]: 


VNepinwelon. ) . 1 


ASElowgraph is a labelled, directéd graph G, con- 
taining. andistinguishedsvertexry -tsuchethatgeverny yertex 
ineGeistreachable fromy. 9) Vertex vilsical ledtthesbegin 


vertex. 


DEGinLcion 5.2 


A flowgraph of a canonical microprogram S, isa 


flowgraph Ge in which each vertex corresponds to an SLM. 
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Let each vertex be labelled by the name of the SLM it 

represents. Then an edge iy is drawn from vertex S; 

to vertex 2s ede 

(3) the last MO in S. is neither a BMO nor a HALT, 
and 5. follows S; Iwo 7 Ox 

Cit) the last MO in S. is a BMO, and the first MO in 
Se is either an explicit or implicit destination 


of the BMO. 


Figs. 5.5 and 5.6 schematize two canonical micro- 
programs. The corresponding flowgraphs are given by 
Figs. 5.7 and’ 5.8 respectively. 

Consider the execution of the microprogram repre- 
sented by Figs. 535 andjo./. \Clearly, regardless of the 
branch decision at Uye Sy and S3 will always be executed. 
Furthermore, they will be executed exactly once. On 
the other hand, depending on the decision at Uge S5 may 
not be executed at all, or it may be executed several 
times.  eSimi larly, 2h eld. 5.6 S¢ is executed if and only 
He Sy 


is executed. 


is executed, and Sc is executed if and only if So 


Thus, given an arbitrary flowgraph Gor we may 
identify pairs of vertices 5,78, satisfying the property 
that Si is executed if and only if 8; is executed. Such 
vertex pairs will be called symmetric pairs. Their 
significance lies in that MO's in symmetric pairs are 


potential candidates for global parallelism-. » Note ain 
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Flowgraph of Fig-5.6 


108 


Pig. o.oo that S3 and S. do not form a symmetric pair 


since the execution of S. does not imply the execution 
es S32. 


Consider a directed path 


Gi elnl 232520 weak bee Lak-1, kok eee 


within a flowgraph, where the S,'s and e., 's denote 


yk 
respectively, the vertices and edges in the path. Then 
PRisPsala £o include SyrSoreee Spi also, P is said to 
DewE nem > 


to S)- If invalence (S,) = 0 and outvalence 


ul 
i) —— 0, the path P sls sald CoO be maximal mintuitiavely, 
a directed path P is maximal if it cannot be extended by 
an edge at either end. Given a maximal directed path P 
from Sy to Sir Sy is Said to be the origin. and S). the 
terminus of P. 

A path P. in Ge is) distinct 1f there-exists no 
other path Pa ig) Ge such that E(P, ) = Bee where E(p) 
denotes the edge set in P. 

Note that a directed path in a flowgraph may 
inciudera directed clrCculLtedas a Ssubpach, = fOr InScance, 


Fig, 5.7 contains 3 maximal directed paths of which one 


includes a directed cirult: 
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Furthermore, all these paths are pairwise distinct 
since E(P,) = {@,57e53}, E(P.) = {€)91@o9 log he E(P3) = 
{e,3}-. Henceforth, I shala omit the word “distinct” 1t 
being always understood when a path is being referred to. 

Conditions by which symmetric pairs may be identi- 


fied are given by the following: 


Theorem 5.1 


Let 4 be a pair of vertices in a flowgraph Ce 
Then 5475. form a symmetric pair if the following condi- 
Er1ons nold; 
(i) All maximal paths that include Ss also include Sa 
(412) All maximal paths that include oe also include Si- 
(PR wAny directed =circuiu Chatvincludes Ss. also includes 
Ome 


J 
(iv) Any directed circuit that includes a also include 


Proof 


Let Dye be such that (i)-(iv) above are satisfied. 
Suppose that in executing Go, S. is executed. Then 
exactly one of the paths that include S; will be traversed, 
AuCmsOnbDy (CL); e 5 will also be executed. If S. is not 
in a directed circuit then neither is Sas by (iv). Hence 
S. and Ss. will both execute exactly once. If S. US tia 


directed circuit then so is S. by .Gi21)5SO thats it che 
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cincuit is traversed no Sy) times, both S5 and 2 execute 
n times. Thus S. is executed if (whenever) S5 is 
executed. 

Similarly, suppose that in executing Go, 55 aS 
executed. By analogous arguments it can be seen that 


S; executes if (whenever) 85 executes. Hence S; executes 


ree = executes, i.e., ne form a symmetric pair. a 


Theorem 5.2 


IP 2 oy form a symmetric pair then 
(ee) All maximal paths that include S; also include 5° 


Gi All maximal paths that include S5 also include Si- 


Proof 
Let S; and Bs be a symmetric pair. Then by defini- 
tion 
PeeCCuLc LON «On S. implies execution of 55 (I) 
Execution of 25 implies execution of Si (iT) 


Now if condition (i) is not satisfied then there exists 
at least one path, say Pg which includes Ss. but not 35° 
If in executing G. oe is traversed, then S, will execute 
bu 1Ou 2 CONtLAdICEINgG \L) se) SimMidanly 2c cond, Clon 
(ii) is not satisfied then there exists at least one path 
say ie which includes S, but not S.- If in executing Ge 


P is traversed =e will execute but not Sis contradicting 
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Finally, as a special case, the following corollary 


is obtained from Theorem 5.1. 


Corollary vss. 


Vertices et in a flowgraph form a symmetric 
DaleoLt S. is a source vertex and 5. a sink vertex in Gor 


and there exists no other source or sink vertices in Ge 


Proof 


Ge S. and ou are the unique source and sink ver- 
tices respectively then all maximal paths originate at Ss. 
and terminate at oo i.e. all maximal paths in Ge include 


both S; and 35 EhuUsEsatissving condatiens (2) and “(GiyoL 


Theorem 5.1. Finally since invalence (S;) = outvalance 
Wal = 0, S; and 2 are both excluded from any directed 
CLECUL tt. a0 Go. OU 


5.3 Conditions for Global Parallelism 


As stated earlier, symmetric vertices serve as 
potential candidates for the identification of globally 
Parallel MO'’s. Thus the first step in) global analysis 
is the detection of Symmetric pairs. The problem of 
identifying all symmetric pairs in an arbitrary flowgraph 
9 Ee a ea eo CDE SERIE EE DO wa rae A het cl scalar cia 


(ly) Nsource vertex in a) directed graphs @ivertrex of 
invalence 0. A sink vertex is a vertex of outvalence 
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involves establishing all possible maximal paths from the 
source to all sinks, followed by a search for vertex pairs 
satisfying the conditions of Theorem 5.1. However, even 
if such symmetric pairs are identified, this may not in 
fact, lead to a smaller set of. microinstructions (with 
respect to a local analysis of the flowgraph).~ “This point 
will be further explained below. 

Consider a symmetric pair ae in a flowgraph 


(Sats Let 


Pig i= {P11Por-+e7P } (533) 


k 


be the set of all paths from Si and oat and call this set, 


the path set from Se to S5- Define the internal vertex 


set vas corresponding to ae as the set of distinct ver- 
tices included in the paths Pe Biv excluding S; and S5° 
For example, consider the symmetric pair Sy and 


Seeim the £flowgraph of Figs 5.8. The comresponding path 


6 


set is then 


Pig {Py Por Poe Ee (Sea) 
where 
Be Siar 
pee =) Se. po eer. 
2 Tos 545626 (08) 
Pa 91219922373 35°5 566 


Py = $427959%9353%3484%45°5°56°6 


Ae 


(hoe) 
i (2.2) 
i shh, : ; 
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The internal vertex set corresponding to P is clearly: 


16 


So} : (526) 


Definition 5.3 


Let (S,,S ) be a symmetric pair in a flowgraph Go, 


5 
and Vi5 the internal vertex set corresponding to the path 
set Pig: Then u, in S; and HW, in 55 are global 


candidates if the execution of the sequences Hy, S uy and 


(2) 


Uy, Hy S are state equivalent forsalL Ss oe: 


Theorem 5.3 


Let Coe ae) be a symmetric pair, and Mae the inter- 


nal vertex set corresponding to the path set a ob Then 


Uy in Sis and U, in S. are global candidates if U, B Hy 


COCR emi nits tore dl lecoml avis 
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Proof 
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J 
Up B ee Then the execution of Ho has no effect on the 


Assume that for all Be TS LO ale Sev, 


states of the data sources and sinks of any MO appearing 


in Wage Hence for all EMS 2G WE S uy, and Hp He S are 


state equivalent and so Uy and wu, are global candidates. [J 


(2) "A "pair of sequences of MO"'s say S, and S, are*said“to 
be state ecollivalent 16 for all Aanitial machine, states 
they produce the same final machine state. 
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It should now be clear that from a pragmatic view- 
point, the identification of all possible symmetric pairs 
in an arbitrary flowgraph may not be justified. 

For suppose we establish that a particular vertex 


pair S. 7S. are symmetric, and we also determine the corres- 


J 
ponding internal vertex set Vis: Let the length of the 
i-th SLM be &,. Then, if IVi5| = k, the number of MO's 
contained inV.. is } %.. To establish whether yu, in 
iy gam es x 


os is a global candidate with some Hy in Sis Uy must be 
compared with each one of the } & MO's in Vis: 
It seems reasonable to hypothesize that the pro- 
babiiity ot Uo being data independent of all MO's in Higgs 
decreases as ) & increases. Hence if IVi5 aN ise too 
farge thet probability of obtaining "at pair’ Of global tean= 
dadatest is. likely “to be very small. “In such” a situation, 
the computational work expended in identifying 857°; as a 
symmetric pair is most likely to be wasted. 
AcPanwexemple, consider Fig. 5.8. “Here (S,S¢) 
are symmetric, as are (S,,S,). Since Vj,° Vy¢, Ivo51<|Vy¢l- 
so the probability of identifying global candidates 
between (S5,/S.) is expected to be higher than between 
(S,,S¢)- 
The proposed solution to the above problem is a 
heuristic one. Symmetric pairs are identified in loop- 


free microprograms only; furthermore, the identification 


of global candidates is attempted only between those 
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symmetric pairs eee) with internal vertex set Vi5 


5 | <2  eeOnethie. oacis (S),S¢) LOCO 3 


would not be examined for the presence of global can- 


where |V. 
i 


didates while (S5,S,) would. Note that by restricting 
global analysis to loop-free microprograms, symmetric 
Pairs are identifiable on the /basis of Theorem 5.1, 
conditions (i) and (ii) only. The computational comple- 
xity of the parallelism-detection procedure is thereby 
greatly reduced. 

The present section is completed with the follow- 
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Let NE Ea be a symmetric pair in a flowgraph Ce 
and let Uy nh Si and Ho an = be global candidates. 


Then UprHy are said to be globally parallel, denoted 


Wy ile up if for all initial machine states, the execution 
OfMasmicroinstruction I= {Uy rl, } is state-equivalent to 
the execution of the microinstruction sequence I, = {ut}, 
oy {u,yt. 


In other words I am distinguishing between a pair 
of MO's say UprHes being ‘globally om-elocally parallel 
according to whether am the original flowgraph, Hy rHy 
belonged to separate (symmetric) vertices or to the same 


vertex respectively. 
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Clearly, once Wyre have been identified as global 
candidates, the conditions for Up ave uy, are identical to 


those for’ local parallelism. That is 


This is because, since Uy and U, are global candidates, 


U, can precede all MO'’s in the internal vertex set Oe, 


J 
We can thus construct a "new" SLM Si by rom which) (5.7) 


follows. 


5.4 Identification of Symmetric Pairs in Reduced Flowgraphs 


Given a directed graph G=(V,E), Nes eh are 
strongly connected if and only if there exists a path 
from Ve to ne and a path from Ve to V; IG see ONG Ly, 
connected subgraph G' = (V',E') of G is a subgraph such 
thatratl pairs VG eV' are strongly connected. A strong 
component is a maximal strongly connected subgraph. 

Given a flowgraph, its strong components can be 
determined by any one of a number of efficient algorithms 
[50,67]. A reduced flowgraph ce is obtained from the 
original flowgraph Ge by replacing each of 1ts strong 
components by a single supervertex. The important charac- 
teristic of the reduced flowgraph is that it is acyclic. 
One should also note that the supervertices represent 


several SLM's. 
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AS an example, consider the flowgraph of Fig. 5.9. 
Its strong components are represented by the two sub- 
graphs containing vertices {S5,S3}, and {84 378y478,5} 
respectively. In the corresponding reduced flowgraph 
(Fig. 5.10), these components are denoted by (super) 
vertices S5 and See 

Thus if a given flowgraph contains strong compon- 
ents, it is first transformed to a reduced form. Further- 
more ic the silowgraph contains n >i sinks, ib 1s further 
transformed into a single sink graph by simply adding 
a dummy vertex with edges from all the n sinks to the 
dummy vertex. It is assumed that if such a transforma- 
tion is made, the dummy vertex is suitably identified. 

The first step of the procedure identifies all 
maximal paths in Gk. This, is donemas follows: 

Denote the (unique) source and sink vertices in 
ce byeoe GEOL begin. and a (tome end ame speCllveLy.. 
Thenfa rooted (or directed) tree (call it the maximal 
path (ME); wexree T ,) is constructed such that any path from 
the root to a terminating vertex (leaf) of ie identifies 
a maximal path in Ge 

More precisely, an MP tree Ts corresponding to ce 
is a tree such that 
(a) is is rooted at a vertex which corresponds to B in 


Ge; Call this foot Bee 
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Pig. 5 
Fig. 5.9 ee 
A Flowgraph Ge Reduced Flowgraph a corres- 


ponding to G. 


ahs) 


(b) For a vertex a in ae there exists an offspring 
4, ; : 
S5 in T. 1f and only if there is an edge from S. 
CO, Sean a 
3 s 


For convenience of reference, if a vertex in T 
s 


corresponds to a vertex S. in ck, the former is labelled 


t 


S;. Note that since T_ is a tree, if there are two ver- 


tices gy (say) in G_ such that edges lead from both 


py ph eu) (ap) 


S; and oe to Sur then 


FOr both a and S5 


i appears asa distinct tofispring 


Lemma 5.1 


Let Ts be an MP tree corresponding to co Then 


(a) the leaves of T. correspond to the sink E of ck, 


(b) there exists exactly as many paths from the root 


to the leaves in T as there are maximal paths in 


ck. 
Ss 
ProOoL 
(a) A leaf, say St has no offsprings. Hence from the 
definition of MP tree, S; in Ge is of outvalence 0. Since 


there is only one sink in oF every leaf in ine corres- 


2 Rt 
ponds to the sink E of Ge x 
(b) Since a tree is connected there is a path in Le 


from the root ne to every leaf in To: Let one such path 


be 
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ae ' 
Then Sy ESyan CLispring or Ba Ss Ls an Offspring, ot 
Le Cae P A : 
Syreeee EB. 1S lan -OLerSspring oL fae implying that there 
exists an edge from B to Sir from Sy LO Soreser from 


R 
Snel EOrE 2n oo hence the directed path 


P. = Be 81 e585-+ +n Sn-1 en = 


exists in eae Since P, originates at B and terminates 
at E, it is maximal. Thus if |P| denotes the number 
of maximal paths in ci and jer the number of paths 


in is from the root to the leaves, then from the above 


Piast (Eilat: (5.8) 
Similarly let 
P; = Be, 542955 aieus C31 n=1en” 


be a maximal path in Gk, Then there exists directed 


edges from B to Sir from Sy LO Sopeees iON Se to £E 


al 

SpeR : Oe i ic t 
in Gor hence in Tor ES Aneorisprindg on ee S5 
TSeanvOLespring OL Se and St 15 all OeltSpring on Bo: 
Sora. pach an srs exists from ae to mee hence 

eee ae (5.9) 

4 te 

From (5.8) and (5.9) it follows that |P| = |P-|. O 


The maximal paths can be determined using a modi- 


fied version of the depth-first search algorithm [3]. 
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Algorithm 5.1 


Construction of an MP tree T. Corresponding CO a 


reduced flowgraph ck. 


Input 


ce = (V,E) represented by adjacency lists ADJ[S] 


EGtac ec Vi “Vertex S, € ADJ[S] Leer SS) )ee) breed 


k 5 is 
initially empty; furthermore all vertices in the adjacency 
lists are initially marked "NEW". The operation "SON(6,w) " 


means "create an offspring § of vertex w in the tree". 


begin 

[1] Leys {B}; 
[2] SEARCH (B) ; 
3 STOP 

end 


procedure SEARCH (@) 


[4] for each NEW vertex w ¢ ADJ[6] do 
[5] SON (w,8); 
[6] Mark w OLD; 
[7] SEARCH (w) 
end 
[8] for each vertex w in ADJ[8] do mark w NEW 


end 
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AS an example, consider the reduced flowgraph of 
Fig. 5.10. The MP tree produced by Algoritcigies. as 
shown in Fig. 5.1l. 

Verification of this algorithm proceeds by in- 
duction on n the number of vertices in the reduced flow- 
graph. For the purpose of verification assume without 
loss of generality that vertices are labelled by integers 
jee we, where i, is the source vertex anawn the Sink. 
F further assumption is that the outvalence of all 
vertices in ce cannot exceed 2. That is, all branches 
in the original microprogram are two-way branches. 

For nm=ei 7 le. 1S 0bviougly worrect ly construe ted, 
Assume as the induction hypothesis that for all reduced 
flowgraphs with k-1l vertices, Algorithm 5.1 constructs 
an MP tree rooted at vertex 1. Consider now a k vertex 
flowgraph. For the subgraph containing vertices 
123 nee. em, an MP tree asrcorrectly constructed.) For 
the k vertex flowgraph, Step-[2] causes SEARCH (1) to be 
called, and SEARCH is entered. ADJ [1] will contain 
vertex 2 and possibly, some other vertex 1 (3 < i < k). 
Suppose Step [4] first selects w= 2. Then Step LS) 
creates the edge (1,2) in ee vertex 2 in ADJ [1] is 
marked OLD, and SEARCH (2) is called. By the induction 
hypothesis this call creates an MP tree rooted at vertex 
Im Since 1 is connected to 2, on returning from this 


Call, bee is as shown in Fig. S912 (a). 


Lhe 30 vonotsvsie ats, jet3 wt nolsqnuees xedtay? || 
eattornesd Ils vet dad? .& heenxs toans * i epokstey 
veedoanyd YRwepw? =e mexpoxgoxoim Lsntpize edt ak 7 


.betoutsenos ylisergeo ylauotvde af .T ft = a 20% a 
booubesx Ils xot tedd aiestitogyd soitoubat ont 28 am 
Tay OPP) 


etouisenqa L.é meld imopsn ,2eoiszev [-% rts hye 
xed 19V x won tebianod wt KatIO’ $a batoox pew oan be 
esnitxev painigaace née vodrit sit r0F wear iy 
10% , bedou stares yiagetso> ei es1d TM ne AX y aw bly SP : 
ad od (£) HOMAste ‘agenas (&) qot2 \dqsxpwolt xotzev a end = 
- abpnep LW. HEV Gtk ehexesae 02 BORAT Sate bette 
$2 SE, z/€) £ yosxaw aeri9e emo: yidtarog bus $ xedzev +) 
[2] qeva sat, «S$ =o agvetee text? (0) a928 eaogqua 
et {2} GaA si)" anion sg? oh (8,0) avbs ont setensa 
Le tinoie! Babies ek (s) PIPER HS 
ine WR mae ME np, Meson: Ce rat 


225 


Pi@ eo ecu, 


R ; 
ina 3 £ -Fag., oO. .0 
MP Tree T. £Olr Go of Lg 


Tree 


Tree 
Rooted at 
2 


Rooted at 


Tree 
Rooted at 
2 2 


(a) Ee fee les de (b) 
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Step [4] is re-entered; if ADJ [1] contains no 
other NEW vertex, no other offspring of 1 can exist. 
Steps [5]-[7] are bypassed, Step [8] is executed, the 
call SEARCH (1) is completed, and the algorithm 
terminates. An MP tree rooted at vertex 1 is thus 
CO©rrectily constructed. 

If ADJ [1] contains a NEW vertex i, then edge 
(1,1) is created in T (Step [5]), 1 is marked OLD and 
SEARCH (i) entered. By the induction hypothesis this 
call constructs an MP tree rooted at vertex i (Fig. 
5.42(b)). On completing SEARCH (2), since ADJ [1] 
contains no other NEW vertex no further offsprings can 
exist for 1, hence the tree Te rooted at 1 is indeed 


an MP tree. 


Consider the MP tree ue SHOWNE Ine hd Gems. acs 


maximal path can be explicitly described by tracing 


aepath trom~a lear to stherroot, and reversing hes result- 


ing sequence of vertices. From Fig. 5.11 for example, 


this would yield the following paths (described in terms 


of the vertices only): 
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However, explicit specification of these paths is not 
necessary as I shall show below. 

Having constructed the MP tree, the next stage 
is to identify appropriate symmetric pairs and the 
corresponding internal vertex sets. To do this assume 
that the k maximal paths (or equivalently, the k leaves 
in the MP tree) are assigned Pach numbers lies in 
any arbitrary manner. A path assigned the number j can 
then ‘be ‘simply referred to as “path j". Thus, for the 
above example the path numbers can simply follow the 
SubsScripes assigned to the P's “in (5.70). 

Symmetric pairs can now be easily identified from 
the MP tree. For, as Theorem 5.1 states, a pair of 
vertices are symmetric if they are included in exactly the 
Same set Of paths. This Means that a parr cor wvertices 
are symmetric if the vectors of the path numbers that 
mneiude them are identical. For example, consider the 
Vertexspair <1,2'> an Mig. 5.10. “The vectors of their 
Path Numbers are, from (5.10), both (2s ae oO Ot 
whereas that of vertex 5 is [1,3,5,7]. Thus <1,2'> form 
a symmetric pair while <1,5> or <2',5> do not. 

Instead of actually matching vectors, the symme- 
tric pairs can be more simply identified by assigning 


a weight of pum HO a Vertex lett sl is Included sine pacie). 
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Thus if the sum of the weights assigned to vertex i 
equals the sum of the weights assigned to vertex k, 
then they are included in precisely the same set of 
paths, and are therefore symmetric. The weights may 


be assigned according to the following: 


ANGOrLenm (5.2 


Assignment of weights to vertices of a reduced 


flowgraph. 


For j = 1 step 1 until k do 
begin 
trace path, from lear yto foot fin pathi a; 
Livpath’ J includes vertex i -then WEIGhT 12) e 
WEIGHT [i] + 227+, 


end 


The algorithm produces values WEIGHT) [1],...:, 
WEIGHT [N] where l1,...,N are the vertices of the original 
reduced flowgraph Ge. For example, the weights assigned 
to the vertices of Fig. 5.10 are shown in Fig. 5.13. 


These weights can now be attached to the vertices in ck. 


Theorem 5.4 
Let weights be assigned to the vertices of the 
reduced flowgraph according to Algorithm 5.2. Then a 


pair of vertices have identical weights iff they form a 


symmetric pair. 
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Proof 


Let there be k maximal paths in the MP tree and 
tet these be numbered 1,2,¢.8,k in any arbitrary manner. 
Then the possible weights for a vertex range from 1 to 
os and uniquely identifies the subset of paths that 
include it. Thus if two vertices So have the same 
weight they are included in exactly the same set of paths 
and are therefore: symmetric, Conversely 7 ie Se are 
symmetric they must be included in exactly the same set 
Of paths Say p,d,...,0 andesosthe weights assigned to 
Batnvare 25m 412176 Ne ande-areutherefore 
identical. 0 

Figure 5.14 shows the reduced flowgraph of 
Fig. 5.10 with the weights indicated in curly brackets. 


I shall call such a flowgraph, a weighted reduced flow- 


graph ce 


5.5 Identification of Effective Symmetric Pairs 


Within a weighted reduced flowgraph, there may be 
two or more vertices with identical weights (e.g., the 
eet niet 4) oy onl 2 1S ano maoG.. 5, 4) een general, lec 


S = {S/S Soe be a set of such weight equivalent 


apts ee v 
vertices. We may refer to S as a symmetric set since 
members of S are pairwise symmetric; the problem arises 


as to how appropriate symmetric pairs may be selected 


Crom. 6: 
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Weighted Reduced Flowgraph, ey 


The most obvious - and most expensive - method 
is to examine systematically, the pairs <Si1S5>4 <S)1S3>5 
eee SS) 7S)> for parallel MO's. But such a method 
would be excessively expensive. Instead the following 


heuristics are used. 


(H1] Lt Ss. is 7a Vercvex in Gn corresponding 0 a directed 


Circuit or a strong component in Go, then Si is ignored 
any the wdentitication of Globally parallel MO's, 

[H2] Let Signal’ be a symmetric pair as determined 
according to Theorem 5.4, and suppose both Si and S5 
are SLM'S. Then if each path from S; EO = contains at 
most one internal vertex, ee oSse are identified as an 
effective symmetric pair. 

lis] Ay Vertes Si can be a member of at most one 
effective symmetric pair. 

[H4] Identification of globally parallel MO's is res-~ 


tricted to effective symmetric pairs. 


[Hl] as merely a ceminder that our) analysis jis 
restricted to loop-free microprograms (i.e. acyclic flow- 
graphs), hence any vertex in the reduced flowgraph that 
represents a strong component has to be ignored. 

[H2] is based upon the discussion presented in 
Section 5.3 and restricts identification of effective 
symmetric pairs to those symmetric pairs iss 55? | 


there exists at most one internal vertex in each of the 
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BIST 2] 
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are Non-effective 
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LIst [2] 
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LEST [2] 
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Possible Connections between a Symmetric Pair 
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paths from S. to S. ; §BY this*heuristic, Gvertices#<l,4> 
in Fig. 5.15 are identified as effective symmetric while 
“1, 0> in Figy=5. 16@areunot, 

Such an identification is not the ad-hoc choice 
that’ it Seems. For, notice in) Pig. 5.16 that <¢,9- 
themselves constitute a symmetric pair, and are further- 
more, effective symmetric. In fact if maximal SLM's are 
identified while constructing the original flowgraph, 
vertices 8 and 9 would have constituted a single SLM. 

[H3] ensures that pairs of effective symmetric 
vertices are disjoint; finally, identification of globally 
parallel MO's are restricted by [H4], to effective symme- 
tric pairs only. 

The following algorithm identifies effective 


symmetric pairs in a reduced flowgraph. 


AETORL CoM so. 


Identification of Effective Symmetric Pairs and 


their Internal Vertex Sets. 


Inputs 


Gi) The adjacency lists for all vertices in the reduced 
flowgraph. ADJ [V] is the edgacency Gist lon Vervcexnva 

(2) The sets of SyMMerric vercices ine the reduced 
flowgraph. Let there be L such sets numbered 12... lis 
Pach set 1s ordered such that 1{j] refers to the j-th 


member (vertex) of the I-th set. Furthermore, if the I-th 
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See, contains ky symmetric vertices, then a symbol dis- 
tinct from all other symbols denoting vertices, is used 
as the (k, + 1)-th SlEMenE*Ot gl sro=-Indicate the end of the 
T=thyset. By convention, let I{k, + 2 ae AL OV aL 
eeal <, Jy, 

The symmetric sets are easily obtained from the 
Suzpucs WEIGHT [Lhe ww WEIGHT (N}@or Alaorlthm oc. 

The variable LIST is used to access each symme- 
tric set one at a time. INT is used to contain the 
internal vertex set for an effective symmetric pair. 
TEMP holds a vertex symbol temporarily. Initially, all 
elements in ADJ [V] for all vertices V in the reduced 


flowgraph are marked NEW. 


[1] For LIST« 1 step 1 until L do 

[2] INT 0: 

13] i po Lewes) ee 

[4] Te (Lista =) A tbtst ia la chea 

bod ie List(i) is a strong component Chen 
[6] ie) ead sper 

[7] goto [4] 


end 
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[8] 


[9] 


[10] 


fia J 


[12] 


[13] 


[14] 


esa 


[16] 


[17] 


[18] 


[e9ul 


[20] 


[21] 


If LIST [j] is a strong component then 
3 Saget goto 14] 
end 
If ADJ[LIST[il] #. then 
If q some Ve ADJ[LIST[i]] 3 V is NEW then 
TEMP <V; Mark V OLD; 
Lie oMPeS STi J chiens gocor ligt: 
IND te INT) U0 acTEMP > 
Tf qa some WerADIITEMP]\ 3s We List(s) 
then goto [11] 
Make all V ce ADJ[LIST[i]] NEW; 
TUNEL stay Get 
ER! alee aac) aE 
goto [4]; 
end 
Output Lis hia Ulett jens 
Make all Ve ADJ[LIST[i]] NEW; 
end 


1 eels ety INT ose goto 4] 


134 


VELL) laa ara = 4127 Bg 
xe ome + wwa 
(clmane sw € (ana (20 “giT08 e Be ‘ 
| i u EY ody ents | 
Wau tceinacaiean v Lis eisM 
| ee ad 
sitet >+tae€ + =f 
YEN) otap 
oo 
7 stele rate t feprena ‘suqtuo : 
ai AsV tis sdem 


To verify the correctness of this algorithm, we 
must show that [H1]-[H3] are satisfied, Furthermore, 
for each effective symmetric pair identified, the inter- 
nal vertex set is also identified. 

Firstly, note that since the outermost loop (start- 
ing from Step [2]) is entered once for each symmetric set, 
and in the case of the K-th symmetric set no other 
symmetric set is referenced, all L symmetric sets are 
examined independently. It is therefore sufficient to 
consider the K-th symmetric set, i.e., when LIST = K 
(135 KS LL). The algorithm will be verified by induction 
on the cardinality of the K-th symmetric set, denoted by 


|K 


For |K| = 2, the two candidate vertices are LIST[1], 


ana Eesti 2], jana the following possibilities musm De. con— 
Sidered: 
Case me liclil| 2S as strong component. 


Casedr: LISTI2] is a strong component but not LIST([1). 


For the remaining cases below, neither LIST[1] nor 
List! 2] are Strong components; V, Vue Vi are some vertices 


in the reduced flowgraph other than LIST[1] and LIST[2]. 


GaSe ati ADO UL? (le) = glo tele Ve (BaisGpemecetl 7 (a),) 
Gace IV ADI DISTLUY) = ttsr 124) (Fig, 5:47 (b)) 
Gacy EADIE STL Seay, LUST [21 sae Gs See ota)? 
Gsceavl: ADO [Listilils= 4 
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{v} (Pigts om ivdc)) 


Case VIIF? ADJ[LIST{1]] {V, V5} (Page oe leid) ) 


Case i: By Steps [5]-[7], the candidate vertices 
become LIST[2] and LIST[3]. On re-executing Step [4], 
since LIST[3] = *, the algorithm terminates producing 
(correctly) no effective symmetric pair. [Hl] is thus 
satisfied. 

Case II: Steps [6], [7] are bypassed and Steps [8] 
and [9] are executed. The candidate vertices become 
isi.) sand <LiGih| si, sand again, since list sie —=—e, the 
algorithm terminates correctly without producing an 
effective symmetric pair. Hence [Hl] is satisfied. 

Case III: Step [11] is entered, and since the 
condition in this step is satisfied, the block beginning 
at Step fi2)] as entered. Step [12] cplaces List| 2) ean 
TEMP and masks 1b as OLD an ADJ [LISTI1) 2) -since the 
equality ot Step [13leissealso satistved Seep [lly issre- 
entered. At this point, the first path between LIST{1] 
and IaST[2] satisfies [HZ]. On executing Step [11] 
again, Step [12] 1s) again entered, V placed 1no (kip and 
marked as OLD in ADJ[LIST[1]]. Since TEMP # LIST[2], 
ivitwer(Vilby step [14|jeend scep Ulol is etiterned. 

Now, af ADJIV)) contains Bist l2) SStepel 11) eis sve> 
entered. At this point ADJ[LIST[1]] contains no NEW 


vertices, and both paths from LIST[1] to LIST[2] satisfy 
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[H2]. After executing Step [11], Step [20] is executed 
and LIST{1l], LIST[2] are: produced correctly as effective 
symmetric and INT = {v} as internal vertex. All elements 
of ADJ[LIST[1]] are marked NEW again; Step [21] results 
in making INT the empty set, while the new candidate 
vercices are LISTI3] and LIstTl4|. However, on returning 
co otep 14), since LIST|3] = 9%, the algorithm (correctly) 
terminates. 

Case IV: AS in? Case’ ili, LIST|2] 1s placed in TEMP 
and marked as OLD in ADJ[LIST{i) |): Step (Pl) as re- 
entered. However since no other NEW vertex exists in 
ADJ[LIST[1]], there is only one path (edge) between 
LIST[1] and LIST[2], and Step [20] produces as an effec- 
Cave: Symmetric pair, LISTiL)] ana ListT(2)) and) an empty 
set as the internal vertex set, which is correct. [H2] 
isethnus  satisived. 

Case Vz Steps [2]-15], [8], [10})-[11] "are executed. 
By step [12], TEMP = Vv, and V 2s marked OLD in 
ADI tEesTit}]; and by Steps (14), INT = Vi. 

(a) Tt ADJIV)) does not contain Ki2)) then atyleast one 
path between LIST[1] and LIST[2] in the reduced flowgraph 
does not satisfy the condition of [H2]. In the algorithm, 
Steps [16]-[18] set. members of ADJIAK [2 to NEW, INT eto 
the empty set, and produce as new candidate vertices 
LIST[2] and LIST[3]. The algorithm terminates correctly 


without producing an effective symmetric Dain, 
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(Db) If on the other hand LIST[2] « ADJ[V], then this 
path £rom LIST[1] to LIST]2] is a valid one. By Step 
[15], Step [11] is re-entered; Steps [11] and [12] 
produce TEMP = LISTI2]j.and by Step [13l, step 111) 4s 
entered again. Since all vertices in -ADI [LEST (Ll are 
now OLD, Step [20] is next entered. At this point, both 
Paths from BieTl1i to List |2)-satisty shelly. and INT =e 4 
step [20] ‘thus correctly produces as output LisT[1) and 
LIST[2] as effective symmetric and {V} as the internal 
vertex set. ADJ[LIST[1]] is made NEW, and the next pair 
of vertices become LIST[3] and LIST[4], while INT = 96. 
After executing Step [4], the algorithm terminates. 

Case VI: After Step [10] is executed, Step [21] 
produces as new candidates LIST[3] and LIST[4]. The 
algorithm terminates producing nothing which is correct. 

Case VII: By Step [121], TEMP = V, sand V is marked 
OLD an ADJILIST[I}is by Step. [14] INT = ivi. 

(a) Now, 1£ LISTI2i) 2 ADJ VI) then Steps (15) istentered 
aeter Step [15], and V <ADJ(LIsT([l) |» made NEWS the con-— 
dition of [H2] is of course not satisfied. By Steps 
[17]-[19], INT = ¢, and new candidates are LIST{[2] and 
biclicin Gince List (2) 7] => the algorithm terminates 
correctly producing no output. 

(b) If LIST[2] ¢ ADJ[V] then a valid path is obtained. 
On returning (by Step [15]) to Step [11] no NEW vertex 


is found in ADJ[LIST[1]]. Step [20] then produces as 
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output, <LIST[1], LIST[2]> as the effective symmetric 
pair with {Vv} as the internal vertex set, which is 
correct. Steps [20]-[21] also make Ve ADJ[LIST[1]] NEW, 
INT = ¢, and LIST([3], LIST[4] the new candidate vertices. 
The algorithm thus terminates correctly. 

Case VIil: Step [11] when first entered detects 
Va aS NEW; by Step [12] TEMP = vi and Vi Ine ADO A Giotto | 
1S marked OLD. Since TEMP ¢ LIST{2], Step [14] results 
in INT = {vy}. 
[a] is LIst[2] « ADJ[V,], then one valid path has been 
found, and after Step [15], Step [11] is re-entered. 
By Step [12] TEMP = Vor and V5 in ADJ[LIST[1]] is marked 
OLD. Since TEMP # LIST[2], Step [14] results in INT = 
{V1 7Vo}. 
rset ere hi I Mie LIStT[2] « ADJ[V,], then the second valid path is 
also found; Step [11] is re-entered. Since no NEW ver- 
f1ces' remain, Step [20]) 1s entered and “List| 1] 7 ListT(2i 
produced as an effective symmetric pair, with {V1 1Vo} as 
the anternal vertex set. This 2s correct. 
PaGra lite, GUST ea: 2 ADJ[V5], then the second path fails 
towsatisty (H2),-and Step: [16l) tollows steps (tol. Ali 
elements in ADJ[LIST[1]] is made NEW, INT is made empty, 
and the new candidates become LIST[2] and LIST[3] by 


Step [18]. The algorithm thus terminates correctly 


without producing any output. 
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[b] If LIST [2] 3/7 ADJ[V,], then the first path is 
itself not valid, hence no output should be produced. 
After Step [15], Steps [16]-[19] are executed, result- 
tnenin V,V5 © ADJ[LIST[1]] marked NEW, INT made empty, 
and LIST[2], LIST[3] the new candidate vertices. After 
Step [4], the algorithm terminates correctly. 

This completes the proof for |K| = 2. Assume 
now as the induction hypothesis, that the algorithm is 
correct for |K| = n-1, and consider the case of |kK| =n. 

With the K-th symmetric set as input, i.e. with 
LIST = K, Algorithm 5.3 would proceed,starting with 
hes Li hesList (2) wastthesinitiaktoatr of tandidates. 
Eventually, the first n-l elements in K will have been 
processed producing (by the induction hypothesis) the 
correct (partial) icuteut, fandttihe n—thivertex imtbist 
will appear asifa tcandidate. Veli List ([n] “us the fidrst 
candidate, then the second must be LIST[n+1] = «, and 
no further output will be produced. Hence the algorithm 
USECOLTreCt. 

If LIST[n]) is the second candidate, then some 
Tesplin tor 1 sei, Son tewthe finest candidate. = More 
over LIST[i] cannot be in some effective symmetric pair 
that has already been identified. For, such a pair is 
(by the induction hypothesis) produced correctly in 
Step [20], and the next Dair of candidates (by Step [21]), 


always follow such a pair in the ordered set. Hence, Le 
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“Gistii), LISt(n]) “is a parr of candidates, LIST(a] is 
not effective symmetricswith any LIST] (sj n-l). 
If now <LIST(i],LIST[n]> becomes effective symmetric, 
then [H3] is guaranteed to be satisfied, 

Consider now <LIST[i],LiIsT[n]> as the candidate 
pair. Then the possible cases are precisely Case I- 
Case VIII discussed for |K| = 2. By following a similar 
argument, it is easily seen that <K[i],K[n]> is either 
identified correctly as an effective symmetric pair, or 
are both rejected. In either case effective symmetric 
pairs in LIST are produced satisfying [H1]-[H3]. This 
completes the proof of correctness of Algorithem 5.3. O 

AS an example, Algorithm 5.3 may be applied to the 
weighted reduced flowgraph of Fig. 5.14. The adjacency 
lists and symmetric sets for this example are shown in 
Fig. 5,18. The reader may verity that the outputs pro- 


duced are as follows: 
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ADD, (iy = 2} 
ADJ [2'] = {4} 

ADI (4 p= 157,65) 
ADJ [5] = {8} 

ADJ [6] = {8} 

ADJ [8] = {9} 

ADIN le tO, 
ADJ [10] = {12} 
ADJ {il} «= C12) 
AD ae ee ee les ee oa 
ADD [13 "= 115) 
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5.6 The Parallelism-Detection Algorithm 


Let Sey Sa be an effective symmetric pair and Vis 
its internal vertex set as determined byraAlgor. thm 5 Be 
Consider a pair of MO's Uy nil Ss. and Uy in S5- To de- 
termine whether Hy and UH, are globally parallel or not 
requires determination of: 

(a) Whether Hy and HU, are global candidates (see Section 
SeoypDem. 9.5) that is, by Tneorem 5s wietner formal. 

be ald gl Vags 
(b) Whether the condition (uy, 6 He) V (uy ¥ Ug) Pees ts 


Wy Supe and 


Picde(See-DeEL. 5.4 and Condition 5.7). 

An additional problem is that within the SLM Sar 
there may exist some MO Hess Up such that Ue must be 
executed prior to Ur and yet Ws is not globally parallel 
to any MO” in S.. Hence a further condition that must be 
satisfied is: 

(c) Whether the movement of HW, out OL 2 ae S5 is 
constrained by one or more MO's within 55 Ltseri. 

He there 2s) such a2 Constraint, tuere ws Clearly 
MoO point in testing for conditions (a) or Abj)- “Similarly, 
assuming the absence of this constraint, there is no 
point in testing for condition (b) unless condition (a) 
Holds, We can thus establish a priority ordering for the 
testing of these three distinct conditions. 


A further aspect of the problem is that the total 


number of microinstructions obtained by global analysis 
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should be no greater than the number obtained by local 
analysis only. Suppose the sequence of microinstructions 


corresponding to S; (as determined by Algorithm 4.1) is 


given by 


such that Ta <I55 Kpotetan eke 


best be placed in one of these microinstructions, or at 


Then Ho from S. Can tat 


worst, in a separate microinstruction, i.e. other than 
those specified as Toi- The latter however, may lead to 
an increase in the total number of microinstructions. A 
Safer choice in this case is to place Uy in one of the 
mMicroinstructions obtained from local optimization of Shae 
This at least ensures that the total number of micro- 
instructions will not be larger than that obtained by 
purely local analysis. 

In view of these considerations, the basic approach 
used by the parallelism-detection algorithm is as follows: 
[1] For the sequence of MO's in S;, use Algorithm 4.1 
to obtain a sequence of microinstructions Ioi° 
2. For each MO Ho ae) Sa invoke Algorithm 4.1 and 
determine whether H, can precede all the microinstructions 
of =e obtained thus far ao this set Boo TENG, a coen 
continue with Algorithm 4.1 and place Hy in the earliest 
possible microinstruction of Teas 


zg 


is] Otherwise, determine whether Hy B es for all Me at) 


Vout Li not, then place iy in the earliest microinstruc- 
1) 
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Eon OLS] &, 
S) 
[4] Otherwise invoke Algorithm 4.1 in order to place 
Ua in some microinstruction in Toy: Tf Te canner. be 
placed in an existing microinstruction, then place it 
in the earliest microinstruction of ae 
The complete algorithm is presented below. 


Algorithm 05.4 


Identification of parallel micro-operations in a 


symmetric pair Se with internal vertex set V__. 


Pq 
Comment 
= . = ' 
Let So SUy bore eer beri hence t number of MO's 
17 Sap Denote the sequence of microinstructions corres- 


Onding sco S. and) S. iby and I respectively. As in 
p g D g y, sp sq p y, 
Atgorithm 4.159% 15 a pointer, to the microinstruction, in 


re "currently" being examined; i' is a pointer to the 


wiacesc eM CLOAnstEuc Lom e111 Tq at any given time. For 


i serve similar functions except that 15 will 


.t 
Si ao eattD 
remain invariant since Loe is already determined prior to 


examining 6.. "jpoints to anvelement of the anput SLM Sa: 


As in Algorithm 4.1, the expression "branch (x)" denotes 


a predicate whose value is true if x is a BMO, FALSE 


otherwise. 


As\ tors. Let, the wesultin 
is) Apply Algorithm i victe) 7 g 


sequence of microinstructions be 
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. . t 
[2] af Pela Ky 
[3] Loe LY est ee Os I, = 1; 
[4] J) elas lke 705 
i 5 ° 
PED ieee toe seit ys) ane TOr 
[5] it branch (us) then 
begin 


ifull; uy * vet; 
then I.e¢ Eevee 
else 


eee oa: 
begin ee eG Dy eee eee end 


goto [4] 
end 
[6] If (j = 1) V (1, = ¢) then goto [G1] 
[7] laecge eT, 3v(ullpuj) AvG Mus) 
then 
begin 


jeer ecw hee eee 
La {us ti 


goto [4] 


end 


' ' = 
[8] Tf (a weT 3 u yus) AG irae Patines hae) 


then 


begin 
pees tush; 
goto [4] 


end 
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[9] i Ca weTy au du, ASK nSK, # 9) A 
1 ' =) 
We | des us wu eT; {u}) 
then 

begin 
Tet, u tuys 
goto [4] 

end 


[10] While [(i Su. ASK n SK, = o)V (yp r* us) ¥ UE I, ] N\ [aes 0) 


do 
te (16 us ASK n SK, =9) vee ee Ener aa, 
phar al 
end 
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fee Le i= 0 then 
begin 
pl La te ie then 


begin 


else goto [G1] 


end 
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[12\- While g Weslo ~(u] [u5) do 
begin 
lee i eo 


aE eg eee nl 


begin in < tus 
ah Uo ea 
goto [4] 
end 


end 


(3H) ee ay u tus}; 
ay Ree 
goto [4] 


G ee EV 3 V(L. then 
[Gl] ieee ca B u) e 


PG 
begin 

Ee ey (I, = >) then 
begin 

goto [4] 


end 


else if a # 0 then begin Te eG se a Bl 
goto [4] 


end 
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[Gla] else begin 
ee 


White ke> 0 do 


begin 
regi 
Kiek = 1 
end 


end 
Ss * 
bC2 i» Le at ae vn |], uj) Aw Cr Ax us) then 
begin 


ie leche 


begin 
a tus ti 
goto [4] 
end 


else goto [Gla] 


end 
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[G3] rea 4 sf. ’ : - 
Ea Kan gi ap wy us) AG i He 4 Pits {u}) 


J 
then 
begin 
L <a Ta et ee 
i, i, 5 
goto [4] 
end 
[G4] alee Saal Ec asses eee Ola ASK 7 SK. NOT , 
cee ea nSK,7o)] Atu'| I, wy 
Weel Saw iat) 
a2 
then 
begin 
Toe lee dale) 
To, ‘> J 
goto [4] 
end 


[G5] While [(u ae ASKn SK.=$) V (yu A* Hs) “Wer. JATi, > 0)j 


2 
do 

begin 
Tf Olin ok Mokhn— 0) Sve Moet themed. Soule. 
Deu he : o U ss 5 
1, — 15 - sk 

end 

[G6] ie i, See hen 


begin if a'#0 thn begin 


I_, +I 


e U tu, i 


By! 
i, < ar goto [4] 


end 


else goto [Gla] 


end 
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[G7] WHILe@O eT) Shean) do 
begin 


Nee 4: 


5 tones 


2 
ibe 15 > ae then goto [Gla] 
end 
[G8] es Tits U tust; 
ae 


: v 
+9 a! 


goto [4] 0 


Verification of this algorithm proceeds along 
steps similar to those for the verification of Algorithm 
4.1 and is therefore omitted here. A few comments are 
however necessary. 

Steps ([2J=-[13] “are. almost identical’ to Algorithm 
4.1; these steps construct the microinstruction set eee 
However, if uw. the MO currently being examined is such 
thaceat. (a) is the first Mo in Syi or (b) can precede 
all MO's in the set te of microinstructions obtained so 
far, the algorithm then checks whether ue Bu, fOr Uy in 
the internal vertex set ‘oe (Step [Gl)).. Li" thispreia— 
tion holds, the algorithm then proceeds to check whether 
uenhean be placed in onerot ‘the mrcreinstructrens of I 5p 
(Steps [G2]-[G8]). Lt ue is such that it cannot be 
placed insany of the microinstructions of hers then sub- 
step [Gla] ensures that Hs is put in an existing micro- 


instruction of Tq if possible or otherwise, in a newly 


created microinstruction. 
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59.7 An Example 


The reader may obtain a more intuitive idea of 
the way Algorithm 5.4 works by considering an example. 
Fig. 5619 ishows, theticanonical INECEOPrOGraNso rl! 1d. Oe. 
represented rather more conventionally. The time vali- 
dities are indicated in parentheses, while the operational 
units jare Amplici ty 


For this example, clearly: 


= = SU Hor Harber 
Vog~ <Us lg rae lg? 
S 


q = <Ugr Uy gr yy Hy ory grb 4? e 


Since V contains a single vertex which is also an SLM, 
let us assume (without loss of generality) that Algorithm 
4,1 has already been applied to it. Two microinstruc-— 


tions are obtained: 


7 = {Us rUgrla? 


= 
II 


5 {ug t : 


Thus, on executing Step [1] of Algorithm 5.4, the sequence 
of microinstructions will be as shown by Fig. 5.21(a). 

Mie input string at this point 1s Sa The subsequent 
pattern of construction of the microinstruction sequence 


is shown as Figs. 5.21 (b)-(f), while the final output 


is shown as Fig. 5.20. 


oe _ 


WenevaeD sxoMm smerny |: 
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<yitegtngiaga® = gee 
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Loe 


Wy : AIL + Rl (Vy); 

Wo) > AO = ShieAIL (V5); 
H, = Rl « RO (V,); 

a= Lf RT 0 fhen goto Ug (Vii 
Ye: All RI (Vi); 

We = AIR « R3 (V,); 

U5 3 R2 A V5) 

Wg = R2 + AO (Vy); 

Pours | MARS Una (yi 

“a0 MBR « MEM [MAR] (V); 
Way MBR + R1 (V,); 

W192 AIL « R1 (V,)? 

143 AIR + R2 (Vi); 

Wa AO « AIL + AIR (V4); 


Fig. 5.19 


An Example of a Canonical Microprogram 


7H = {Uyprlortg} 

aoe {uzrHyo} 

ae = {Ug} 

I; = {Us rg rlat 

a = {hg} 

Ty = {uyyrby3} 

Ty = (Hyorhyg} 
pes ee 


Output from Algorithm 5 dn ont heubxanple Of Fig. 3.19 


1.) 06 #80 nine 
C0) Oh + Sie aa 

PV) oa + RAM tol 

50 oni os * gst r 

10,4) iA GEA feet 

| : ee 3d ah + REA ‘ena! 

(¥) ATA + GTA + OA 


{uy rl} 


{uz} 
{uy} 
{Usrlgr la} 


ai {ug} 


HZ rig} 
{uzrHyg} 
ee 

{Us rUgrz} 


sat tugs 


{Uy rHgrg} 


{uU3zrHy9} 
= cue 


= {Usrlgrlz} 


{ug} 
= 


= {49} 
(e) 


Construction of the Microinstration Sequence 


3, ce {Uy rg rg} 


ADs a aten: 
“Sa Se 
ny = {Usrlg ra} 
i ete 

(b) 
Aes fae | AWwiyatie 
4a aot? 
ie Sete 
Ty = Cusrtigrty) 
I, = {ug} 
1) = {uy} 

(d) 
ape = {Uy rag} 
a = {ign} 
aoe eae Tey 
T) = tigrtigehy) 
r, = {ug} 
T) = tiyyrby3} 


IT, ={ty2} 
(=) 


for the Example of Fig. 5.19 
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{ qlwgu nigh - e 
Logline? = a 
Eyal = a ee, 7 
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| {pgu} = | } 
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2-8 Conclusions 


The main result of this chapter is the development 
of a partial theory of global micro-parallelism and its 
application to the design of algorithms for detecting 
parallelism in loop-free, canonical microprograms. 

The output produced by this system of algorithms 
May not always be optimal, since several heuristics were 
used to make the problem analysis more manageable. How- 
ever in the worst case, the output produced will 
certainly be at least as good as the output produced 
by local analysis only. 

This last assertion may appear somewhat weak con- 
Sidering the computational work involved in global analy- 
sis. But we have already seen an example where better 
(in fact optimal) output was produced. Moreover, given 
that parallelism-detection is to be done statically 
(at compile-time) and that many microprograms will be 
executed several - probably hundreds of - thousands of 
times over a machine's operational life time, the over- 
head incurred in global analysis will probably be justi- 
fied, where instruction execution efficiency is the main 


architectural performance objective. 
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CHAPTER VI 


LANGUAGE CONSTRUCTS FOR HORIZONTAL MICROPROGRAMMING 


6.1 Introduction 


In the last two chapters, I have discussed the 
design of some algorithms for the identification of 
parallelism in canonical microprograms. In concluding 
Chapter V, it was also pointed out that the particular 
global approach developed here, may not always lead to 
an optimal output. Moreover, given the computational 
overheads involved in global analysis, not all micro- 
programs may be suited for such extensive analysis and 
optimization. 

These are practical constraints on the use of 
mechanical optimization which implementers (of a micro- 
programming support system) must evaluate, after taking 
into consideration, the host machine architecture and the 
nature of the machine tet instructions to*be imple- 
mented. 

tn Stns chapter, L& will consider jt furtuer aspect 
of micro-parallelism; essentially, this constitutes 
another addition to the catalogue of techniques for 
solving the problem of constructing horizontal micro- 


programs. One might call this the linguistic approach 


to the problem. 
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To motivate the discussion recall that the 
execution of a microinstruction in general, may 
involve both parallelism and sequential activation 
within a microcycle - particularly if the machine 
utilizes a polyphase timing scheme (see Section 2.1 
and *ChapteriIT1) . Meneover,) aim “the caserof£ multicycle 
microinstructions, these same effects may "spread" 

Over several imicrocyciles . 

Designers of high-level microprogramming languages 
have recognized and responded to this fact in - predic- 
tably = twordiftfiecrentiways Thus} tte lavoid "explicit 
denotation of such relationships between micro-operations, 
Ramamoorthy and Tsuchiya [54] proposed a language by 
which microcode is specified in canonical form, while 
tie ttask of extracting horizontal amicroimstructions was 
delegated to the translator. Partly influenced by 
Eckhouse's work on the vertical microprogramming language 
MPL [26], I had expressed rather similar ideas in an 
unpublished thesis) (22), °Thisrparticalar approach iim 
fact, provided the. impetus for the search for *parallelism-— 
detection algorithms, many of which have been described 
in the earlier chapters. 

At the same time, proposals were also made for 
representing horizontal microinstructions explicrolyeain 
the source text [56]. However, Chu's CDL [15] appears 


to be the only instance of a microprogramming language 


Lown 


sloyotsivm, to > sina ht ,rovoss0M «(EEE ar ni? 


"beoxge" usin edoate emoe sastd ssnekdooxtentoxoim | 


-2afoyegsade isusyee revo 
eopsupasl paifmeteoxqotoin fovat-Anist 20 ezenplesd | _ a 
-jibexq - ni dost eidt ot bebnogeds snes heaimpovex avad 


tioilaxs Biove ot. evil .Bysw Jr9x9FRIB ows ~ ylides. 


| anotssreqo-oxctm nsewded aqirenottsies dove ‘to nottstonsbh 
¥d eperpriss a Besos [het eyidouet: an yisxoomemBA 
olidw ,mr0% faginonss at hoittooqe ei ‘sbonoaim doddw 
asw ancizouizemoro im isdgostion paitoszsxe Ro. aaed eds 
yd beorontink yiirst --<olslanast ect) oF hesapeteb 


spsupmed paimuisrpozdo15 tm fpoiszev Sit Ae. , saw a! eayodstos 
ns 1h esbt me Limbe asrith< boadorque! beth I 128) tam 

tk d269x495 ieivoltany eirit “AS8) aiearit hore fdugaw 
-m2ifelisxseq +o% fowase edt 10% gussomi odd bebivorg 2st 
bodtroreab need “ove dotdw 3e Yoem Lemitizopis salsveteb 

| ae A .erotqado asihess, oft at 


4) 


imo s: 


containing facilities for specifying the timing 
characteristics of micro-operations. In CDL, the 
programmer associates with one or more micro- 
operations, a "label" designating which part of the 
microcycle the operations are tO (be Vaculvaced sim. 

In other words, CDL allows the expression of polyphase 
(as well as monophase) horizontal microprograms. 

In this chapter, I shall describe a set of 
constructs which, like CDL, permits horizontal micro- 
programs to be represented explicitly. As stated in 
Chapter I however, the proposed constructs are moti- 
vated by the following considerations: 
re) It seems desirable that a microprogramming 
language should give the programmer a choice as to 
whether the horizontal microprogram be specified 
explicitly or otherwise. Such a choice seems rather 
important when one realizes that automatic generators 
of horizontal microcode may not always yield optimal 
code. Thus, the microprogrammer may wish to optimize 
critical segments of microcode manually at the source 
level; in which case, the horizontal microprograms must 
be specified explicitly. 

(2) Given the necessity of devising constructs for 
horizontal microprogramming, a further crucial charac- 
teristic of these constructs must also be considered: 


for the purpose of microprogram validation and 
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understanding it is highly desirable that the micro- 
programs be structured. However, because of the com- 
plications induced by polyphase timing schemes, structured 
horizontal microprogramming cannot merely utilize the 
well known concepts of structured sequential programm- 
ing [18]. Of far greater relevance are the notions of 
concurrent programming developed by operating systems 
BOeCOL Steen Lone) 43,5)]8 
Thus, a major aim in the design of the proposed 
CONSErUGES, 15 .cO facilitate the Construction .of struc 
tuved horizontal microprograms, ).structused™ in ithe 
sense that for each of the proposed constructs, specific 
and useful inductive expressions [48] can be defined. 
As in the case of software design, the ability to make 
such assertions about the state of the machine should 
greatly facilitate the verification and understanding 
of microprograms. This aspect of microprogramming 
language design has been almost entirely neglected 
mnertorore 
The following discussion focusses entirely on 
constructs for expressing the "horizontal" characteristic 
of microprograms. I shall assume (and this is not a 


particularly restrictive assumption) that other 
ee 


(1) For a fairly comprehensive review of the status of 
microprogramming language design, the reader is 
referred to the very recent monograph "Foundations 
of Microprogramming" by A.K. Agrawala and T.G. 
Rauscher (Academic Press, 1976). 
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constructs exist but that they represent individual, 
indivisble operations (including branches) whose syntax 


conforms to say, that of CDL or Ramamoorthy and Tsuchiya's 


SIMPL language [54]. 


6.2 A Special Constraint on Construct Formation 


Whatever be the form of the constructs that we 
may choose to propose, they must satisfy the following 
COnStraint: 

Given a statement (i.e., an instance of a proposed 
construct), and assuming the existence of an unambiguous 
mapping of that statement into object microcode, the 
parallel/serial relationships between components of this 
microcode must be unambiguously evident in the statement 
itself. 

Such a constraint on the form that. constructs may 
take is imposed from a concern for enhancing both, com- 
prehensiveness, and verifiability, of horizontal micro- 
programs. 

In writing a horizontal microprogram, the programmer 
may conveniently mimic the logic described in Chapters 
Iv and V in obtaining a final product. That is, the 
programmer may begin with a sequential program; then 
convert this into an equivalent horizontal program by 
examining all data dependencies and hardware resource 


conflicts between the micro-operations. Our present 
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interest however, lies in the "final" Product. “FOr, 


given a sequence of statements 


we must ask (a) whether the individual o,'s are valid 
statements; and (b) whether the statement sequence is 
valid? These questions can be answered if we know the 
following: given a valid sequence of microinstructions, 
what conditions must hold between the micro-operations 
(i) within each microinstruction and (ii) belonging to 
different microinstructions? 


Given a microinstruction 


Te = {uyrtgrese rly} 
the relation Ws I Hy, is defined for all Urb aa 
Note that the I, relation between some pair of micro- 


operations Was holds only in respect to a specific 


Vk 

given microinstruction Bae and merely indicates the fact 

that War Uy have been placed in I, - If a different micro- 

instruction Ly contains wu, but not u,, then ws | | 
A Microinstruction 1S Said to be valid; 16 ats 

execution satisfies the Foinion ne two conditions: 

(D1) The state of the machine can be determined exactly 


after the execution of a known set uw of micro-operations, 


provided the machine state is known prior to executing uw; 


and 
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(D2) No two micro-operations can use an operational 
unit at the same time. 

From earlier discussions on both potential and 
actual parallelism, it should be obvious that given a 


valid microinstruction fos U Hy implies 


Uv eo Ve = 20 [CVn V7 A (Cu; 8 uy) A (U,oU, =$)] . 
(6 31) 


Dihasstollows: from the fact that if Ujrl, are in the same 
microinstruction they must be potentially parallel. 
(6.1) simply specifies the condition for pairwise po- 


tential parallelism. 


6.3 Representation of Horizontal Microprograms 


Condition (6.1) specifies constraints on the com- 
ponents of a hora zontal microinstruction. §1 shalt 
discuss now, some language constructs that reflect these 
Constraints. 

Consider for the present, only those micro-opera- 
tions which are executable within one microcycle. Follow- 


ing [13], I propose the concurrent microstatement 
"o" cobegin Wy? Ugi veeF HL coend (6,2) 


where Uyprlgreser by are micro-operations, to specify that 


Wyre eer are to be executed "concurrently". That is, the 
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at least one phase of the microcycle when all the H,'S 
will be in execution. The execution of "o" terminates 
only when the execution of all the micro-operations in 
"o" have terminated. 

Given a concurrent microstatement "o", the con- 
artron ls # ¢ must hold for all pairs ale Des Osag 
Hence a valid concurrent microstatement is one for which 
ene second term of (6.2) helds for all “pairs Wyrds in 
the statement. 

For example, suppose for some particular host 


machine, the time validities of some of the micro- 


operations are as shown in Fig. 6.1. Then the statement 


cobegin A« B; F +H coend (6.3) 


wy Pp 
4 
Hm ww 


G + B+ E (Adder) 
i= sh (shitter) 


| 
| 
| 
| 
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Fig. 6.1 


Some Micro-operations and Their Time Validities. 
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is a valid one (assuming of course that the transfer 
paths implicit in these two operations are distinct). 


One the other hand, the statement 
cobegin A+B; B+«I; I+«shl F coend (6.4) 


is considered invalid since firstly, "A+B" yand "Be 1" 
are not data-independent though their time-validities 
are the same; and secondly, even though there are no 
conflicts between "B+ I" and Sh Lore, sched leila. 
their time validities are disjoint, preclude their simul- 
taneous presence in a concurrent microstatement. 

Given a concurrent microstatement, we can make 
rather specific assertions about its effect on the 
machine state. Prior to wllustrating this, let me 
introduce first, the term microprocess to designate any 
sequence of events at the register-transfer level, and 


secondly, the notation (after Hoare [35]) 


(PyOetRs (6.5) 


which indicates the partial correctness of the micro- 
process © with respect’ ‘tothe assertions’ P ‘and Rk; i.e., 
ife-an assertion P is true of the machine state before 
start of the microprocess Q, and Q terminates, then the 
assertion R is true when Q terminates. P and R, are 
often called the precondition and postcondition respec- 
tively Of 0, and tie entre expression (6.9) an 


inductive expression. 
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In the case of the (valid) concurrent micro- 


Statement, given that 


{P,} Hy {R,}, Py} Us {R,},-.., {P} ids, LR SE (6.6) 


then 

{P, A Ps No. -GeA Pex cobegin UpiUgiee+7U, Coend {R, A RjA..AR, } 
Cape 

Furthermore, since Upreee ety are all executed within a 


microcycle, the postcondition RA Ry A 


true before the end of that microcycle. 


seer ail Ra will be 


Referringsto: Fig. “6.1 Tagainyrconsider the: two 
micro-operations "A<B" and "G<+B+E". Clearly the time 
valzdity of "A+ BW precedes that of “Ge Btn”. “in a 
particular situation, a programmer may wish to execute 
"A<B" before "G+ B+E" in which case the two operations 
can be placed in the same microinstruction. However 
"A< B" would clearly be executed before "G+«B+t+E" though 
both would execute in the same microcycle. 

To distinguish between concurrently executable 
micro-operations, and sequential execution of micro- 
operations within a microcycle, the latter can be repre- 
sented by means of the short sequential (SS) micro- 
Statement: 


‘ GaG 
shseq 5, 7 % end ( ) 


where 0, is a single micro-operation or a concurrent 


microstatement; and 0, is a single micro-operation, a 
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concurrent microstatement, or another SS microstatement. 
This construct states explicitly that the time 
validities of all micro-operations in O71 precede the 
time validities of all the micro-operations in 05 (ese cr; 
denoted V(o,) <V(o5)). Furthermore, all micro-operations 
in 0, U 05 are executed in a microcycle. They must there- 
tore, be *placed in one microinstruction. 
Consider for example, the following SS micro- 


statement: 


shseq 
OR cobeg fn 7A BD Eacoend; 
"dj" shseq 
"03" cobegin 
Freshly [Dis 
E+B + D; (Gro) 
CARS 5.0 a@lCAh. 
coend 
ore MIR = CM 
end 
end 


This is a valid SS microstatement provided that (a) 
V(o,) < Vion)i (b) V(o3) < V(oq) (since 05 is itself an 
SS microstatement); and (c) OF and O03 are valid con- 


current microstatements. 
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Assuming that these conditions are satisfied all 6 
micro-operations can be placed in the same microinstruc- 
tion since between any pair of them, condition (6.1) 
holds. 

The inductive expression for the SS microstatement 
(6.8) is as follows: Since or and 05, are executed in 
sequence, but are both completed in a microcycle, if 
{Po 


{O} ands 40} Gn, 4k} then 


au 2 


{P} shseg o. ; o 


l end {Rs (6.10) 


2 


and R is true at the end of the microcycle. 

Comparing the two expressions (6.7) and (6.10), it 
Should be evident why a clear distinction between these 
two situations has been made. For otherwise, if we were 
to extend the scope of a "valid" concurrent microstatement 
so aS to allow the inclusion of any set of micro-operations 
such that (6.1) was satisfied then the inductive expres- 
sion (6,./)) would certainly not, hold satwall times. .By 
DEoviodngedistineteconsenuctsmrOr suleccmiLWwomd Iolinct sa, 
microprogramming situations, distinct and sharply defined 
assertions can be made about the machine state. Hence 
both verifiability and comprehensiveness are greatly 
enhanced. 

It should be pointed out that though I talk about 
the machine "state", in expressing the pre- and post- 


conditions, it is sufficient to specify the states of 
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the sources and sinks of these micro-operations only; 
the other registers, memories, etc. will of course 
remain unchanged in content at the time that this par- 
ticular microstatement is in execution. 

Thus far, I have described constructs correspond- 
ing to microprocesses that are executed in 1 microcycle. 
Fig. 6.2 shows a microprocess spanning 3 microcycles: 
here, a main memory read ("MBR « MEM[MAR]") operation 
requires 3 microcycles. However certain other micro- 
operations are to be executed while the memory read is 


in progress. 


G<+H | 
MICROCYCLE D<+A+B E< D | 
SEQUENCE 
| F+«E 
| 
<—MICROCYCLE>|£ MICROCYCLE -|~«& MICROCYCLE —> 
MBR + MEM[MAR] 
MEMORY | 
CYCLE 


LO a One 


Relationship Between the Microcycle and the Main Memory 
Relationship es 


Cycle 


Such concurrency involving multicycle synchronous 
micro-operations cannot be expressed uSsing= ct hne= concurrent 


or SS microstatements since the postconditions of these 
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statements are true by the end of the microcycle and no 
later. In fact, note that the postcondition for~a con- 
current microstatement may not be true atethe end ol the 
microcycle though it will have become true earlier in 


the cycle. This happens for example when o, in (6.8) is a 


i 
concurrent microstatement. Since V(o,)< Vlo5), given 
{p} o, {Qi}, the assertion Q might not be true at the end 
Of the microcycle. 


TOVexpress multrcyele concurrency, 2 propose, the 


extended concurrent (EC) microstatement: 


dur o, do o, end (G51) 


iL 2 


where O4 is a micro-operation, and C5 is either another EC 


microstatment, or a sequence 


(6212) 


in which om) is either a micro-operation, an SS microstate- 


vt 
ment, a, concurrent microstatement, oT the empty micro- 
Statement (see below) ‘such that for 2 <7 = k=l) the 
execution of 0,; iS completed in a microcycle 1mmediately 
preceding the microcycle in which OO Aen is initiated. 
Given an EC microstatement, O71 and 05 will be 
executed concurrently; execution of the EC microstatement 
terminates only when both O71 and J5 have terminated. 


In some multicycle programming situations, additional 


dummy cycles may be necessary to synchronize certain 
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events. One way of introducing a dummy cycle is through 
the use of a "NO-OP" microinstruction. The empty 
microstatement referred to above performs sohis functaon: 
it indicates the execution of an empty set of micro- 
operations, this "execution" requiring De mierOocyvele. 

It may be denoted simply by the symbol sDiglebah 


The example of Fig. 6.2 can be expressed as: 


dur MBR + MEM[MAR] 
don FAG Bi; 
Dre TAL at Bis 
shseq 
cobegin (6713) 


GP SH 


end 


The EC microstatement must of course, also satisfy 
the rule of disjointness; that is, eterningetOm(G. 11), 


Ufo. designates a particular micro-Operation, jsay Wye 


1 

then for each micro-operation Be specified in 05, the 
ti et U. = . Because 

condition (Vn Vi ZA oO) A (Cu, 8 ree ( nis >) 

of the disjointness rule, given {P,} 0, {Q,} and 
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{P, AP,} dur 0, do so, end {Q, AQ, } : (614) 


However, since the only timing assertion made about 
ory is that it will execute concurrent to every micro- 
operation in O5, we cannot make any general statement as 
to precisely when (relative to the beginning of execution 
of the EC microstatement) Q, A, will hold. For example, 
0, may continue for a few more cycies after 05 has ter- 
minated or vice-versa. The most precise general statement 
that can be made is that the earliest time at which Q,A Q, 


may possibly be true is when o. has terminated. 


2 
When the host machine structure allows only syn- 
chronous operations and the duration of o,'s execution 
is known, then of course, rather specific timing asser- 
tions can be made. Referring to (6.13) for example, if 
(or Since) it is known that a main memory cycle requires 
SEMLcCrocyc less (Fig. 0.2) ,aand since O5 requires 3 micro- 
cycles to complete, Q, 4 Q5 will be true 3 microcycles 
atter initiating execution Of Ene £C microstatement, 
The EC microstatement can also be used to describe 


parallelism involving asynchronous micro-operations. For 


example, consider the statement 


"o" dur MBR + MEM[MAR] do Rl + R2 end (6<15) 


Tn this case, if "MBR « MEM[MAR]" happens to be an 
asynchronous Operation, and takes longer than) 3Rl< Rar 


then o's execution terminates only when the asynchronous 
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operation terminates. 


The programmer can construct entire horizontal 
microprograms using the above three constructs together 


with the long sequential (LS) microstatement: 


lseq Seta Wa Bac 3 it oO, end (Geb) 


where each oO is a micro-operation or one of the micro- 
statements already defined. This statement carries with 
facie Meaning that Lor des. iw< m—1: 


(a) if neither Oo; nor o; contain asynchronous opera- 


+1 


tions, the components of O complete execution in a micro- 
cycle preceding the earliest microcycle in which any com- 


ponent of O; can begin execution; 


+1 


(b) ie Oo, contains an asynchronous component, then 


O begins execution only when oO. terminates. 


atl 
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The LS microstatement is thus the multicycle analogue 


of the SS microstatement just as the EC microstatement is 
the multicycle analogue of the concurrent microstatement. 


Thus, given 
{p, } O71 Deh, Deo eo Nea Jnn on monary) oe De eed 


then 


LP a US eCieC eo eet a end LE ae : (6217) 


As in the case of the EC microstatement, the precise 


time (relative to the beginning of the LS statement's 
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execution) at which Put] tS true cannot be stated in 
general since one or more of the o,'s may be EC micro- 
statements. However, if a particular LS microstatement 
contains no multicycle components, or there are no 
asynchronous operations and the timing characteristics 
of the micro-operations are known then time-specific 
assertions can be made. 


One must note the distinction between the SS and 


LS microstatements. For example, given 
shseg ory iO end 
lseq J, * 9% end 


and assuming that the inductive expressions 


{P} shseq o o, end {R} (6.18) 


1? 


{P} lseq 0, 7 OG end {R} (6719) 


are both true, then the distinction, lies in vthoetein (6.46 ), 
R as, true at the end of the same microcycle in which OF 
wae initiated while ine (Oulo) eR vsimoet -eruerat thevend 


GEao- ts MLCrOCyYC Le, since 5 cannot begin execution until 


i. 


aeleastuche following @wesocycile. 


6.4 Representative Examples 


Given below are some examples of horizontal micro- 
programs constructed Using, tie Statements proposed above. 


Assertions about the machine state and timing are inserted 
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at appropriate points in a microprogram, the sequence of 
these assertions providing a proof of the microprogram's 
partial correctness. ‘?) 

Examples 1-3 describe microprograms for instruction 
Eetch’ ("IPETCH"), «and the anterpretation of (a) a storage— 
to-accumulator add" ("ADD") and (b) a "zeroise storage" 
("ZEROSTORE") instructions for a simple microprogrammed 
machine designed originally by Rosin [59] and further 
elaborated by Flynn [30]. In these three examples however, 
all micro-operations are assumed to be synchronous. Exam- 
ple 4 repeats IFETCH, assuming this time that main memory 
read (and write) is asynchronous. 

An explanation of the mnemonics used for these 
examples iS given in Fig. 6.3. 

storage (memory buffer) register 

memory address register 

accumulator 

instruction counter 

instruction register 

main memory 


control (read only) memory 


microinstruction counter (control memory address 
register) 


microinstruction register 


adder output register 


BGs 203 
Explanation of Mnemonics for the Rosin/Flynn Machine 
a a cag eee p II SASSTe  aa eS e eeea  e ee tical nat, AEE ae eee aS rake ae ae 


a a SE ee ee ee. Sa ee 

(2) Recall <that«a-programsisipartiallyscorrect im aceeltiner 
produces the desired result or fails to terminate. For 
further discussion of partial and total correctness,the 
reader is referred to Manna [48] and Owicki and Gries 


Ls 


notsodzdent tok en 
~ape70da" ete ton 


Cy } on 
Nayarit serctae® £4 . tdyb $f) a k 
iA ¢ : 
HammMASpordord LM, ee 5, 202 anolson’s ROT ZO! 
tod sxu} brte fea) hyd banner hereto 


9 
yrevewor: dskaninxs SaTAe a nT POEL ones 


“mse .avero eto atl: ot banutess 546 2H0k 


yroiem nos Fons emid ards primes OTE, adsegor | e aig 
| pies al (saben bas) beet | 


saad 102 hse 25 :omethm gift to ae es : 
£43 aid nariten estas | 
Ry ll i — AA ee | AR OI 7 . 
f yedarpot (ead td yrome i 


; aoe 
1941p, agenbe 


by, TyA 


= 


i 


Ntomen (yt no Pei . 
ePagbos \Yvoonomn pny teyavos: 


? ry 4 
i 


“tas eal? re ; | 
isso te cael ei 
wevekpes mie enege all pang 


" zedainen § oe a mn 1 dere | 
ivieeet Sen . Bare i 


wy 7 


| on a fe i ais “ + bys 
lao “ j ser e v 


BS 


Example is DIFETCH (Version 1) 


eeteraiets aelepeqiay eee a she Ce aj; IC = a,; BEGIN CYCLE 1} 
lseq 
MAR = IC; 
St ee Sveterste vere eee MAR = a,; END CYCLE 1} 
dur SR + MM[MAR] 
do INCR IC; 
sBatsse (svete lette tcevetows ee re ea ay + 1} 
rye. 
Tans 
nas: 
INCR MIC 
ose) cynms teen edae sy oie ene ioveuewe tens {MIC = a5 +1 } 
end; 
PCR PP ORCS DRRCR REE PO EES SE IC = ay +1; MIC = as +1; 
SR = MM[a,] =i; 
ENDSCGYGLE yi 2 6 
IR «+ SR; 
sates Mem Mea ove eiewelereyols Pee, sas) 
cobegin 
Opens el 5 
Mia oes 
coend; 
end 
BWR oie pale ene tee or etek Nae dete ; MICp_3 = IRjo-15 = OPCODE (i) ; 
MIC, _9 = 03 
GCs ae Pols selma: 


1 
END CYCLE y + 2; y 2 6 
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Example 2: ADD 


Gree 16 Ler 6 (0601680 61s) elelale 6) ele olelelalsione MIC = ay; TRo-41 = 21? 
ACC = dj BEGUN CYCLE 1 
lseq 
MAR « IRo_j, 
in auedsseuent legeloke Avs tote: cic ahh Aes a,i END CYCLE 1} 
dur SR + MM[MAR] 
dosent; 
gies 
Mids 
mieialce 
INCR MIC 
S anete laterotetevete See tetctoletaleretoie ee tes Vu ee as + 1} 
end 
Silat etekcleteterexsters verevers aiay Seute rere SR = MM[a, ] = d,; 
MIC = ay + 1; 
END. CYCLE. y 2 6 
shseq 
cobegin 
ADDLEET ="SR- 
ADDRT + AC 
coend 
ether e onset tne {ADDLEFT = d,; ADDRT = d,} 


AC «+ REG; 


Goto IFEITCH 
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eececeee elereiel ole slelelelete chonchoiorererevene ten a Ca d, + ds; 


END CYCLE y +2, y 2 6; 


MIC=a,+1;} 


MIC =address of IFETCH 


Notes) In Fiynn's description, of Rosan's machine. the 
Overall microprocess "REG <« SR + ACC" is implemented in 
terms of two micro-operations: “REG =~ ADD = AC™ and 

"REG «+ ADD + SR", executed in the same microcycle, where 
ADD here, refers to the adder. The implied transfers 
"ADD + AC" and "ADD <« SR" are concurrent and take place 
at the beginning of the microcycle, while the implied 
twanster “REG <= sum of AC and SR” occur atv the end of the 


cycle. 


Example 3: STOREZERO 
Sia ois eleteletsre tense. evecebers aVatereteuctene AC! = Xs TRo_11 =a; 
BEGIN CYCLE 1 


lseq 


REG = ADD =~ AC; 


SD00c ata io ts hie ee Ge = ok EN DEC VC UE ela, 
ING =e. 18) 

Becta fal overete tale ts coke ehatokevocers 1ACt=..0> SEND) CYCLEM2) 
cobegin 

MAR < IR5_11 

SR < AC 
coend 

MAR = a; SR = O; 


END CYCLE 3 
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dur MM[MAR] <« SR 


dO7 AG .REG; 


ais usta tovelc teeeeusteueee ete el sxe eiefevtcs {ac = x} 
nil 
no 
nil 
nid 
end; 
sreaedoere wisisietere se atene etshenclahegere ete sere | Sb) Gen =n Oe © menace 
* CYCLE y 2 8 


Wee ated va teusyae che Oe ee Se eae of ve | 


END CYCLE y +1; y > 8 


Note: The operation REG + ADD + AC by itself causes a 
straightforward transfer of the contents of AC through 


the adder to REG. 


Example 4: IFETCH (Version 2) 


BUM eesccn eck ve totes oketats te erots av eters UC aa eon a,il 
lseq 
DIA aoe ess 
Pe OS TE OCR {MAR = a,; END CYCLE 1} 
dur SR + MM[MAR] | 
do INCR IC end; 
aloha Satter Ligier aie telietous: ohasaire IC = a, tli SR = MM[a,] 


= i; END CYCLE y 2 2 


ite — ee y, 
” 4 Paar crase sens +1 iodine _ 
+o | onan 
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nna in oe: OOK - Dan noisexeqe ee i 
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INCR MIC; 
IR + SR; 
ee SOG CK OG oe RHe: ae ce tuly Parise 210, 
END CYChE vate, Vs 
cobegin 
MIC, _, * Ne Se 18g 
MIC, 4 + 0 
coend 
end 
5-6 OS GIG Gao EEO ne ee Srevenctorste MIC) _3= TRy5_15 = OPCODE (1); 
MIC, _9 = 0; 
IiC= ay tol sel Ra as 


END GVGuR yo + 3jor 272 


NOTE: Sine thts version of PP ETCH, snoticesthatevhie timing, 
assertions are less specific than in Example 1. This is 
because of the EC microstatement “dur SR =MM(MAR] 
eo INCR @C end". Clearly all Chat can be, Sard vatter 
encountering this statement is that its execution will 
require at least 1 cycle. Since this version is based 
on the assumption that "SR+ MM[MAR]" is asynchronous, 
this is the most precise statement that can be made, 
uniess we have some turther information, e.g, that the 
asynchronous operation will take more than 3 cycles. 

Ft "1s “important to note however, "chat the vasvyn— 


chronocity of an operation cannot be inferred from the 


construct. 
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Example 5 below, implements a "multiply" instruc- 


tion (MULT), and is based on the machine structure, timing, 


and the multiplication algorithm discussed by Husson 
[36, Section 2.4]. Note the use of a polyphase timing 
scheme to facilitate the add operation within 1 micro- 
ecvcele, | 

A slight alteration to Husson's notation has been 
made in construction this example, viz., the BZ ("branch 
on AOB zero") operation is represented here using the 
more convenient if..then notation. The mnemonics are 


explained in Fig. 6.4. 


Left input to the Adder 
Right inpul. to the Adder 


Adder output register 


General purpose registers 


EUG ea Orr 


Explanation of Mnemonics for the Husson's Machine 
Ce Oe EO ee ee 


The final example, Example 6, describes the "RAL8" 
microprogram for the Hewlett-Packard 2116 microprogrammed 


computer as specified by Parnas and Siewiorek [52]. 


180 


> i . 
vA 7 ‘ 
-outtaitl 7 
- a a 
sparuts owt : und 
noes vr ae wie maisisoe in nokssotiats 


Mn 7 a) . 
onimts sendy log, ye? Sau sit stow oth ‘S| mot 


~oxoim <' patie aakysxego his saz odetiline 


oe 


-—- 
43 


7 


a : - 
need asi noisston e'noRabl qaitaxesia tipita A 


dotiexd”) Sd ors ..siv ,Siqmexs eirls noktowss anes ot obs mo 
: A 
ue 


7 


en 
ott potebp ozed iboanedexges 2i noistatoge {*otex HOA n0 


e916 ecinomenmeanit .<myitason pefd..3i Inataevned: exoar 


6.9 .ett al beatelqxs 


= 7 —_ oa 


~e654 ed OF tradi tte! =: Af 
~ehSA ods oF svqak sHpin + AL 


“geteipen dug7tc r9bbé 


L 


BYSTRipes eeogusa Laten 


\ - - mp5 
22... = BLA 


aptivosM 6 sebntett. sil 462: 22 Prema 3o aia ca 
[es ee a 


a - 43 am - a . - 
' 


© 


" aaa" Pre eadiixo8 


awe 


eee eEE 
7 hac a ‘ana = 
a 


Duly A gaara sotamaxe Janke — i 7 
LES bs aaa odd 708 nexpengoss 
wl yd boltinee oa 19 IU 


pet a 


i ; A 6 : 
ve. a ol 


181 


Example 5: MULT 


She tes c yale elevevelexercke eters) scnteiete RYo= x 2 0G) RZ =y S05) AOR =.0- 
— eageinya al 
lseq 
R3 < AOB; 
See ae TOO OOO OO or C8 USC e ce icme ins) tes Oe simisiay evecama ally 
shseq 
LA +*R1; 
Seeded eeewere: 6s Neve tenis earieie Viet Hee LA Saxe 
shseg 
ADD; 
Rees acheter eananere Se PSA eee eeeiCAO Buss) 


If AOB = 0 then EXIT 


end 
end 
end 
aratatere hh 3 een Meee ele XO Ro = y+ R303 
END CYCLE 2 
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shseq ADD; 


R3 < AOB 
end 
end 
Macieretere tere. Aisieboretaetane Sooper sat eats Sables 
ne OF A CYCES 
shseq 
ekestnocorepertas oe er Ore: {BEGIN A CYCLE} 
cobegin 
RAS eR: 
DA 
coend; 
shseq 
SUBTR; 
shseq 
Rl «+ AOB; 
ee eee ee ee oe {RIS OF 
If AOB # 0 then LOOP 
end 
end 
end [Loop] 


eee eR > 0 END Ore AsCYCLE, 


end 
eosere 082280802 8 © © @ eovoeenee e@ @ @ 4e = 0; R3 = Vt a 
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Example 6: RAL8 
Sige 1s ie'.06 01615) 6) 6 ares ees ss ele telerelere ces |A = XxX? START CYCLE GL 
I ear 
lseq 
MB < O; 
1 10} 


I < MB <iliere iMeles5 


De 
fe 
= 
5) 
i 


x } 


shseq 


Soe aro Gh nee eae ee hereteleteree ete eTe a, DV OMe eee 


end 


end 


eeeeeeeseee#e#rnQneee#e#eete8f @ eoeeeeecsge&e&eescrneees#eee#ee# {END GYCiE 4} 


shseq 


REVSOu clay 


{RBVS 2x} 


eoeeoeoeoeecsteeeeoeseseeteecseeeeee9ee COOL O EC) OOo a: Se 


shseq 


TBVS « RBVS x 2; 


A EP ees a ee oe ANAS 4x } 


A <« TBVS 


end 


{A =) 4x} 


066 6) 6 10 6 0 6 Oe 6) 0) 6 O06. 00) (81 91:8'9 0) e060 6 6. FS e).8 8 


end 
eae {END CYCLE 5} 
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shseq 
RBVS <+ A; 
PIT Patan eee ws ee cm he bce ee Per ae REVS Weed | 
shseq TBVS + RBVS x 2; 
SVantl gy Sraweletet sl Costes chet ethic tay steed ss mm | 
A + TBVS 
end 
= 08x} 
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RBVS#=UP- SSBVSi<- 71 
coend; 
shseq TBVS «+-RBVS +*SBVS; 
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end 
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645) eConclusions 


In this chapter I have proposed several constructs 
for expressing structured, horizontal microprograms. 
Since it is possible to have microinstructions in which 
micro-operations are sequentially executed, for the sake 
of validation and understanding, such micro-operations 
Should be distinguished from concurrently executed micro- 
operations. Both these in turn, have to be distinguished 
from concurrency effects spanning over several cycles. 
The constructs discussed above serve to distinguish 
between these categories of "horizontalness". 

Byeassoclating  certarnvaxtons Of ‘execution with 
these statements, assertions about the state of the 
machine can be made, and informal proofs of microprogram 
correctness be constructed. The importance of this 
facility can hardly be overstated, 

One of the key features that distinguish micro- 
programming from "ordinary" programming is the relevance 
Of timing constraints. Except in the simplest machine 
structures, a time-independent description of a micro- 
program is practically valueless. The constructs 
proposed here not only permit relationships between 
operations over time to be expressed, they also provide 
the useful facility of allowing assertions to be made 
about timing. Such assertions may be used for example, 


in comparing microprograms for the degree of optimization 
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achieved. 


As a final aspect of this discussion, returning 
to the problem of microprogram translation, one should 
note that the LS microstatement indicates explicitly to 
the translating system that the "source" code is already 
in horizontal form and so the translator should not 
spend time in attempting to detect parallelism within 
this code. As I had mentioned in Section 6.1, the 
programmer should also have the facility of either par- 
tially OpElmMizing a microprogram,, OF NOt, Cpeima Zing, Le 


at all. In either case, the simple microstatement 


fo} * oO. end (65-20) 


begin 0, ; BrP ep re a Be pits 

can be used, where 0; for 1 < i <n denotes either a 
micro-operation or one of the microstatements defined 
above. Given a simple microstatement, the translating 
system must complete the optimization process using for 
instance, the algorithms described in this thesis; 

notes that 1£asome O5 happens to be one of the micro- 
statements described earlier, it will itself have been 


optimized by the programmer, and can be treated as a 


single micro-operation in subsequent mechanical optimi- 
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CHAPTER VII 
CONCLUSIONS 


7.1 Some Remarks on the Taxonomy of Microprogramming 


Systems 


Within the established classification scheme of 
microprogrammed control units [58], the focus of 
attention in the present work has been, the class of 
horizontal, polyphase systems. Within this class, 
monophase schemes constitute a limiting subclass. But 
a microprogramming system exhibits many of the attri- 
butes of a complete computer system, and indeed, has 
often been conceptualized as an "inner" comput [22]. 
From this viewpoint then, we obtain what is essentially 
a special kind of parallel processing (inner) computer. 

As I have remarked in Chapter I, parallel pro- 
cessing is a rather broad concept and several classi- 
fication schemes have been proposed as convenient frame- 
works for categorizing machines [6,37,68]. One well 
known and widely used taxonomy due to Flynn [27,281], 
classified computers in terms of the amount of paralle- 
lism within the instruction stream and/or the data 
Stream. Note that in this context, an instruction 
stream is simply a sequence of instructions executed 
by a processing unit, and a data stream is a sequence 


Of operands that are ted, loga;processor. 
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By specifying single or multiple streams of 
instructions and data, the following classes of systems 


are obtained: 


(Js) Single Instruction - Single Data Stream (SISD) 

(4) Single Instruction - Multiple Data Stream (SIMD) 
3) Multiple Instruction - Single Data Stream (MISD) 
(4) Multiple Instruction - Multiple Data Stream (MIMD) 


The question is, within which of these categories 
does the horizontal polyphase microprogramming system 
Fad Peintor 

Suppose we designate the contents of a chunk of 


control memory by an array: 


qT, [Tyy) 112" 113: chevenete ’ Higa 
I, [T5471 Inge Inge ceeeer Top! G7a1) 
ea [ele eect Meet crr iv ' 

Here, each row, be represents a microinstruction. es 


denotes the micro-operation specified for execution 


from the k-th field of ey 


At the microprogram level, since parallel effects 
are exhibited between micro-operations and not microins- 


tructions,it is the micro-operation that bears analogy 
oe Nae caine nue ve a oe cee oe a a eee 


GRINGtS that “ia may be*ehe null micro-operation, “1.e. 
the micro-operation that does nothing: a NO-OP. 
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with the instruction at the program level. Hence the 
sequence of micro-operations that are executed from any 
OneroL, the columns of ‘thepasray. (7.1) pebeans analogy 
Withean)"instruction stream": ithis sequence of micro- 
Operations is routed to a particular part of the machine 
data flow which is then appropriately activated (see 
Section 2.1).. What we obtain then, isa multiple jns- 
Ervuction stream Situation. 

Classification of the data stream is however, not 
so easily obtained. For, at the program level, a multi- 
ple data stream is unequivocally exemplified: by a 
sequence of vectors in which the vector elements bear 
no relation to one another, and corresponding elements 
of successive vector operands constitute a data stream. 
PMhis wsaseen for example, inv the case of ITLRIAC 1V [7] 
which is an SIMD system. Inthe MIMD class of systems, 
instances of multiple data streams are, in addition to 
vectors, data, £0r concurrent, independent tasks; sacpin 
the case of parallel evaluation of arithmetic expressions 
[53] or concurrent execution of independent processes in 
Amepecch, recognielonesyctem. s thes labberelssone Oratic 
main applications envisaged for the Carnegie-Mellon 
University Multiprocessor iJoll. 

fn all these cases; multiplicity of the data, or 
rather the mutual separateness of the data fed to the 


separate processing units, is evident. At the micro- 
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program level however, it is difficult to conceive of 

the operands to the j-th micro-operation of a micro- 
instruction as not being closely related to the operands 
tor the ‘kK=themicro-operatione() 70k) e= Rather, tne 
operands for these different micro-operations seem to 
form a single, meaningful data item. For, theyianput 
data to a microprogram (which is interpreting some pro- 
gram-level instruction), are presented by the contents 

of some words in main memory, the contents of the 
registers within the data flow, and possibly, the partial 
contents of some of the control memory words. This entire 
collection - which is in fact a component of the machine 
state - constitutes a single data entity that is merely 
fragmented and distributed to the various parts of the 
data flow. 

The sequence of machine states corresponding to 
the execution of a sequence of microinstructions is thus 
the closest analogue we can identify to a data stream, 
and there is only one such stream corresponding to an 
"inner" computer. One may conclude therefore, that a 
horizontal microprogramming system approximates most 


Closely, can MisDemachine. 


eos -P lans stor Future Work 


The principal results of this study can be sum- 


marized as follows: 
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l) Development of the notion of potential parallelism 
and its use in constructing polyphase timing schemes and 
in the minimization of control memory word lengths. 

(2) AN optimizing algorithm for therdetection of 
parallelism in straight-line microprograms. 

(>) Analysis of loop-free canonical microprograms and 
the construction of a method for identifying parallel 
micro-operations in such microprograms. 

(4) The design of a set of language constructs for 
representing horizontal microprograms. 

As extensions to this work, there are in particular, 
two rather important and promising areas for study: 

(A) Implementation of the proposed parallelism- 
detection algorithms with respect to commercially 
available microprogrammabie machines. It may be noted 
in passing that while the wider context within which 
these algorithms are relevant is the design and imple- 
mentation of high level microprogramming languages, the 
algorithms can be implemented as an independent process- 
ing system. 

One of the problems that the implementer must face 
is that of representation; more precisely, the algorithms 
assume that micro-operations are represented in the form 
of 5-tuples <OP,SC,SK,U,V>. For the particular machine 
being used for implementation, these distinct micro- 


operations must therefore be individually identified. 
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This task Gs: notitas!*formidable as it may seem. 
For instance, I have recently begun a program of study 
in which as a first step, micro-operations for the Varian 
75 [79] were identified (there are surprisingly, less 
than 150 of them) and converted into the form of 5-tuples. 
Using this representation, the Jackson-Dasgupta algorithm 
has been implemented for the Varian system. 

Implementation of these algorithms will certainly 
provide a powerful support feature for microprogramming 
and emulation. It will also provide a means of experi- 
mentation. A particularly interesting range of questions 
I would like to see answered is: given a machine struc- 
ture and control memory organization, to what extent will 
the average degree of actual parallelism (i.e., the aver- 
age number of micro-operations/microinstruction) be 
affected by changing from a local, non-optimizing algorithm 
(erga, the IDealgonithm)i ste a localloptinyezimgealgoritim 
(Alo. 45) <and tthen stowtheigilobal «methods (Aligims 33)k 
Will the average degree of parallelism be bounded within 
rather narrow limits or will there be significant dif- 
ferences? How does the machine instruction type influence 
the tdegree of parallelism?© Andtiinally, \whatewilt bepthe 
overheads incurred in global optimization? 

As far as I know, the only published work where 
micro-parallelism has been investigated empirically, is 
the report by Barr et al [8] who found that, for the 


particular machine under study - the Argonne Microprocessor 
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(AMP) - the average degrees of parallelism for problem- 
oriented microcode (for a graphics application system) 
and for microcode interpreting a conventional set of 
machine instructions were not significantly different. 
This was however, only a static analysis. There is quite 
evidently, much scope for further study. 

(B) The second area relates to the constructs proposed 
in Chapter VI. These constructs constitute a contribu- 
tion to the design of microprogramming languages and will 
in fact, form some of the basic elements in the design 


of a language currently being planned by this author. 


U9 


In addition, the constructs provide as I have demonstrated, 


a representational basis for validating microprograms. 
What seems immediately necessary, is the application of 
these constructs to microprograms written for some actual 
machine and explore their adequacy both in respect to 
representation, and proving microcode correctness. 

At the time of writing, microprogramming seems to 
have reached some sort of a crossroad at which its 
"significance" is being Critically assessed [61]. Rosin‘’s 
concept of the "reasonable" machine and his contention 
that microprograms serve to construct a reasonable super- 
structure on an unreasonable base is well worth consi- 
dering as a novel and useful notion. 

An "unreasonable" machine in this context is one 


which reguires the programmer to have to grapple with 
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particular gates, buses, race conditions, split cycie 
memories and other such hardware features - in fact 
precisely those features that the microprogrammer is 
presently Coping with.” “Further implvcations om toe 
concept are that, if reasonable base machines are built 
(and according to Rosin, they can be), then micro- 
programming will lose its raison d'etre, hence will not 
be necessary. 

While I find Rosin's concept of the reasonable 
machine and its realization a useful one, I feel that the 
fact that microprogramming serves to disguise the unrea- 
SOonable hardware from the user, iS a positive attribute 
of microprogramming rather than a negative one as he 
implies. At least, as long as we are unable to build 
completely reasonable base machines, microprogramming 
Will continue tolparvicipacte rather esilgnmiiacantly= an tie 
creation of reasonable virtual machines. 

But of course, unless reliable and efficient firm- 
ware is guaranteed, large scale use of the technique may 
simply lead to the layering of one unreasonable machine 
on top of another. The danger of this has been pointed 
out quite clearly by Lehman [46]. The work reported 
in this thesis will I hope, contribute to the catalogue 
of ideas and techniques that will help in constructing 


reasonable, efficient, and reliable virtual machines. 
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